From c4deaa7426ec213e1057faecf9bea8aad90eb4b1 Mon Sep 17 00:00:00 2001 From: AlongWY Date: Sat, 27 Jan 2024 05:20:08 +0000 Subject: [PATCH] deploy: 72066be21ad467c8ffc76b74c152b38decf3f0ac --- .nojekyll | 0 cache.json | 1 + favicon.ico | Bin 0 -> 15086 bytes index.css | 355 + index.html | 70843 ++++++++++++++++++++++++++++++++++++++++++++++++++ index.js | 39 + 6 files changed, 71238 insertions(+) create mode 100644 .nojekyll create mode 100644 cache.json create mode 100644 favicon.ico create mode 100644 index.css create mode 100644 index.html create mode 100644 index.js diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 00000000..e69de29b diff --git a/cache.json b/cache.json new file mode 100644 index 00000000..bce99211 --- /dev/null +++ b/cache.json @@ -0,0 +1 @@ +{"2024-01-19T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2401.10882v1","updated":"2024-01-19T18:49:36Z","published":"2024-01-19T18:49:36Z","title":"Reinforcement learning for question answering in programming domain\n using public community scoring as a human feedback","summary":" In this study, we investigate the enhancement of the GPT Neo 125M performance\nin Community Question Answering (CQA) with a focus on programming, through the\nintegration of Reinforcement Learning from Human Feedback (RLHF) and the\nutilization of scores from Stack Overflow. Two distinct reward model training\nstrategies are employed for fine-tuning with Proximal Policy Optimization\n(PPO). Notably, the improvements in performance achieved through this method\nare comparable to those of GPT Neo 2.7B parameter variant. Additionally, an\nauxiliary scoring mechanism is introduced, which demonstrates the limitations\nof conventional linguistic metrics in evaluating responses in the programming\ndomain. Through accurate analysis, this paper looks at the divergence between\ntraditional linguistic metrics and our human-preferences-based reward model,\nunderscoring the imperative for domain-specific evaluation methods. By\nelucidating the complexities involved in applying RLHF to programming CQA and\naccentuating the significance of context-aware evaluation, this study\ncontributes to the ongoing efforts in refining Large Language Models through\nfocused human feedback.\n","authors":["Alexey Gorbatovski","Sergey Kovalchuk"],"pdf_url":"https://arxiv.org/pdf/2401.10882v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10862v1","updated":"2024-01-19T18:05:34Z","published":"2024-01-19T18:05:34Z","title":"Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs\n Without Fine-Tuning","summary":" Large Language Models (LLMs) are vulnerable to `Jailbreaking' prompts, a type\nof attack that can coax these models into generating harmful and illegal\ncontent. In this paper, we show that pruning up to 20% of LLM parameters\nmarkedly increases their resistance to such attacks without additional training\nand without sacrificing their performance in standard benchmarks. Intriguingly,\nwe discovered that the enhanced safety observed post-pruning correlates to the\ninitial safety training level of the model, hinting that the effect of pruning\ncould be more general and may hold for other LLM behaviors beyond safety.\nAdditionally, we introduce a curated dataset of 225 harmful tasks across five\ncategories, inserted into ten different Jailbreaking prompts, showing that\npruning aids LLMs in concentrating attention on task-relevant tokens in\njailbreaking prompts. Lastly, our experiments reveal that the prominent chat\nmodels, such as LLaMA-2 Chat, Vicuna, and Mistral Instruct exhibit high\nsusceptibility to jailbreaking attacks, with some categories achieving nearly\n70-100% success rate. These insights underline the potential of pruning as a\ngeneralizable approach for improving LLM safety, reliability, and potentially\nother desired behaviors.\n","authors":["Adib Hasan","Ileana Rugina","Alex Wang"],"pdf_url":"https://arxiv.org/pdf/2401.10862v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10850v1","updated":"2024-01-19T17:51:11Z","published":"2024-01-19T17:51:11Z","title":"Advancements in eHealth Data Analytics through Natural Language\n Processing and Deep Learning","summary":" The healthcare environment is commonly referred to as \"information-rich\" but\nalso \"knowledge poor\". Healthcare systems collect huge amounts of data from\nvarious sources: lab reports, medical letters, logs of medical tools or\nprograms, medical prescriptions, etc. These massive sets of data can provide\ngreat knowledge and information that can improve the medical services, and\noverall the healthcare domain, such as disease prediction by analyzing the\npatient's symptoms or disease prevention, by facilitating the discovery of\nbehavioral factors for diseases. Unfortunately, only a relatively small volume\nof the textual eHealth data is processed and interpreted, an important factor\nbeing the difficulty in efficiently performing Big Data operations. In the\nmedical field, detecting domain-specific multi-word terms is a crucial task as\nthey can define an entire concept with a few words. A term can be defined as a\nlinguistic structure or a concept, and it is composed of one or more words with\na specific meaning to a domain. All the terms of a domain create its\nterminology. This chapter offers a critical study of the current, most\nperformant solutions for analyzing unstructured (image and textual) eHealth\ndata. This study also provides a comparison of the current Natural Language\nProcessing and Deep Learning techniques in the eHealth context. Finally, we\nexamine and discuss some of the current issues, and we define a set of research\ndirections in this area.\n","authors":["Elena-Simona Apostol","Ciprian-Octavian Truică"],"pdf_url":"https://arxiv.org/pdf/2401.10850v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10841v1","updated":"2024-01-19T17:40:50Z","published":"2024-01-19T17:40:50Z","title":"Using LLMs to discover emerging coded antisemitic hate-speech emergence\n in extremist social media","summary":" Online hate speech proliferation has created a difficult problem for social\nmedia platforms. A particular challenge relates to the use of coded language by\ngroups interested in both creating a sense of belonging for its users and\nevading detection. Coded language evolves quickly and its use varies over time.\nThis paper proposes a methodology for detecting emerging coded hate-laden\nterminology. The methodology is tested in the context of online antisemitic\ndiscourse. The approach considers posts scraped from social media platforms,\noften used by extremist users. The posts are scraped using seed expressions\nrelated to previously known discourse of hatred towards Jews. The method begins\nby identifying the expressions most representative of each post and calculating\ntheir frequency in the whole corpus. It filters out grammatically incoherent\nexpressions as well as previously encountered ones so as to focus on emergent\nwell-formed terminology. This is followed by an assessment of semantic\nsimilarity to known antisemitic terminology using a fine-tuned large language\nmodel, and subsequent filtering out of the expressions that are too distant\nfrom known expressions of hatred. Emergent antisemitic expressions containing\nterms clearly relating to Jewish topics are then removed to return only coded\nexpressions of hatred.\n","authors":["Dhanush Kikkisetti","Raza Ul Mustafa","Wendy Melillo","Roberto Corizzo","Zois Boukouvalas","Jeff Gill","Nathalie Japkowicz"],"pdf_url":"https://arxiv.org/pdf/2401.10841v1.pdf","comment":"9 pages, 4 figures, 2 algorithms, 3 tables"},{"id":"http://arxiv.org/abs/2309.14393v2","updated":"2024-01-19T17:33:44Z","published":"2023-09-25T14:50:04Z","title":"LLMCarbon: Modeling the end-to-end Carbon Footprint of Large Language\n Models","summary":" The carbon footprint associated with large language models (LLMs) is a\nsignificant concern, encompassing emissions from their training, inference,\nexperimentation, and storage processes, including operational and embodied\ncarbon emissions. An essential aspect is accurately estimating the carbon\nimpact of emerging LLMs even before their training, which heavily relies on GPU\nusage. Existing studies have reported the carbon footprint of LLM training, but\nonly one tool, mlco2, can predict the carbon footprint of new neural networks\nprior to physical training. However, mlco2 has several serious limitations. It\ncannot extend its estimation to dense or mixture-of-experts (MoE) LLMs,\ndisregards critical architectural parameters, focuses solely on GPUs, and\ncannot model embodied carbon footprints. Addressing these gaps, we introduce\n\\textit{\\carb}, an end-to-end carbon footprint projection model designed for\nboth dense and MoE LLMs. Compared to mlco2, \\carb~significantly enhances the\naccuracy of carbon footprint estimations for various LLMs. The source code is\nreleased at \\url{https://github.com/SotaroKaneda/MLCarbon}.\n","authors":["Ahmad Faiz","Sotaro Kaneda","Ruhan Wang","Rita Osi","Prateek Sharma","Fan Chen","Lei Jiang"],"pdf_url":"https://arxiv.org/pdf/2309.14393v2.pdf","comment":"15 pages, 8 figures"},{"id":"http://arxiv.org/abs/2401.10825v1","updated":"2024-01-19T17:21:05Z","published":"2024-01-19T17:21:05Z","title":"A survey on recent advances in named entity recognition","summary":" Named Entity Recognition seeks to extract substrings within a text that name\nreal-world objects and to determine their type (for example, whether they refer\nto persons or organizations). In this survey, we first present an overview of\nrecent popular approaches, but we also look at graph- and transformer- based\nmethods including Large Language Models (LLMs) that have not had much coverage\nin other surveys. Second, we focus on methods designed for datasets with scarce\nannotations. Third, we evaluate the performance of the main NER implementations\non a variety of datasets with differing characteristics (as regards their\ndomain, their size, and their number of classes). We thus provide a deep\ncomparison of algorithms that are never considered together. Our experiments\nshed some light on how the characteristics of datasets affect the behavior of\nthe methods that we compare.\n","authors":["Imed Keraghel","Stanislas Morbieu","Mohamed Nadif"],"pdf_url":"https://arxiv.org/pdf/2401.10825v1.pdf","comment":"30 pages"},{"id":"http://arxiv.org/abs/2401.05273v2","updated":"2024-01-19T16:57:30Z","published":"2024-01-10T17:13:28Z","title":"INACIA: Integrating Large Language Models in Brazilian Audit Courts:\n Opportunities and Challenges","summary":" This paper introduces INACIA (Instru\\c{c}\\~ao Assistida com Intelig\\^encia\nArtificial), a groundbreaking system designed to integrate Large Language\nModels (LLMs) into the operational framework of Brazilian Federal Court of\nAccounts (TCU). The system automates various stages of case analysis, including\nbasic information extraction, admissibility examination, Periculum in mora and\nFumus boni iuris analyses, and recommendations generation. Through a series of\nexperiments, we demonstrate INACIA's potential in extracting relevant\ninformation from case documents, evaluating its legal plausibility, and\nformulating propositions for judicial decision-making. Utilizing a validation\ndataset alongside LLMs, our evaluation methodology presents an innovative\napproach to assessing system performance, correlating highly with human\njudgment. The results highlight INACIA's proficiency in handling complex legal\ntasks, indicating its suitability for augmenting efficiency and judicial\nfairness within legal systems. The paper also discusses potential enhancements\nand future applications, positioning INACIA as a model for worldwide AI\nintegration in legal domains.\n","authors":["Jayr Pereira","Andre Assumpcao","Julio Trecenti","Luiz Airosa","Caio Lente","Jhonatan Cléto","Guilherme Dobins","Rodrigo Nogueira","Luis Mitchell","Roberto Lotufo"],"pdf_url":"https://arxiv.org/pdf/2401.05273v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.08565v2","updated":"2024-01-19T16:48:59Z","published":"2023-09-15T17:33:24Z","title":"How Transferable are Attribute Controllers on Pretrained Multilingual\n Translation Models?","summary":" Customizing machine translation models to comply with fine-grained attributes\nsuch as formality has seen tremendous progress recently. However, current\napproaches mostly rely on at least some supervised data with attribute\nannotation. Data scarcity therefore remains a bottleneck to democratizing such\ncustomization possibilities to a wider range of languages, lower-resource ones\nin particular. Given recent progress in pretrained massively multilingual\ntranslation models, we use them as a foundation to transfer the attribute\ncontrolling capabilities to languages without supervised data. In this work, we\npresent a comprehensive analysis of transferring attribute controllers based on\na pretrained NLLB-200 model. We investigate both training- and inference-time\ncontrol techniques under various data scenarios, and uncover their relative\nstrengths and weaknesses in zero-shot performance and domain robustness. We\nshow that both paradigms are complementary, as shown by consistent improvements\non 5 zero-shot directions. Moreover, a human evaluation on a real low-resource\nlanguage, Bengali, confirms our findings on zero-shot transfer to new target\nlanguages. The code is\n$\\href{https://github.com/dannigt/attribute-controller-transfer}{\\text{here}}$.\n","authors":["Danni Liu","Jan Niehues"],"pdf_url":"https://arxiv.org/pdf/2309.08565v2.pdf","comment":"EACL 2024"},{"id":"http://arxiv.org/abs/2302.12190v2","updated":"2024-01-19T16:30:14Z","published":"2023-02-23T17:31:40Z","title":"MCWDST: a Minimum-Cost Weighted Directed Spanning Tree Algorithm for\n Real-Time Fake News Mitigation in Social Media","summary":" The widespread availability of internet access and handheld devices confers\nto social media a power similar to the one newspapers used to have. People seek\naffordable information on social media and can reach it within seconds. Yet\nthis convenience comes with dangers; any user may freely post whatever they\nplease and the content can stay online for a long period, regardless of its\ntruthfulness. A need to detect untruthful information, also known as fake news,\narises. In this paper, we present an end-to-end solution that accurately\ndetects fake news and immunizes network nodes that spread them in real-time. To\ndetect fake news, we propose two new stack deep learning architectures that\nutilize convolutional and bidirectional LSTM layers. To mitigate the spread of\nfake news, we propose a real-time network-aware strategy that (1) constructs a\nminimum-cost weighted directed spanning tree for a detected node, and (2)\nimmunizes nodes in that tree by scoring their harmfulness using a novel ranking\nfunction. We demonstrate the effectiveness of our solution on five real-world\ndatasets.\n","authors":["Ciprian-Octavian Truică","Elena-Simona Apostol","Radu-Cătălin Nicolescu","Panagiotis Karras"],"pdf_url":"https://arxiv.org/pdf/2302.12190v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.07107v3","updated":"2024-01-19T16:01:28Z","published":"2023-08-14T12:47:22Z","title":"Large Language Models for Information Retrieval: A Survey","summary":" As a primary means of information acquisition, information retrieval (IR)\nsystems, such as search engines, have integrated themselves into our daily\nlives. These systems also serve as components of dialogue, question-answering,\nand recommender systems. The trajectory of IR has evolved dynamically from its\norigins in term-based methods to its integration with advanced neural models.\nWhile the neural models excel at capturing complex contextual signals and\nsemantic nuances, thereby reshaping the IR landscape, they still face\nchallenges such as data scarcity, interpretability, and the generation of\ncontextually plausible yet potentially inaccurate responses. This evolution\nrequires a combination of both traditional methods (such as term-based sparse\nretrieval methods with rapid response) and modern neural architectures (such as\nlanguage models with powerful language understanding capacity). Meanwhile, the\nemergence of large language models (LLMs), typified by ChatGPT and GPT-4, has\nrevolutionized natural language processing due to their remarkable language\nunderstanding, generation, generalization, and reasoning abilities.\nConsequently, recent research has sought to leverage LLMs to improve IR\nsystems. Given the rapid evolution of this research trajectory, it is necessary\nto consolidate existing methodologies and provide nuanced insights through a\ncomprehensive overview. In this survey, we delve into the confluence of LLMs\nand IR systems, including crucial aspects such as query rewriters, retrievers,\nrerankers, and readers. Additionally, we explore promising directions, such as\nsearch agents, within this expanding field.\n","authors":["Yutao Zhu","Huaying Yuan","Shuting Wang","Jiongnan Liu","Wenhan Liu","Chenlong Deng","Haonan Chen","Zhicheng Dou","Ji-Rong Wen"],"pdf_url":"https://arxiv.org/pdf/2308.07107v3.pdf","comment":"updated to version 2"},{"id":"http://arxiv.org/abs/2401.10774v1","updated":"2024-01-19T15:48:40Z","published":"2024-01-19T15:48:40Z","title":"Medusa: Simple LLM Inference Acceleration Framework with Multiple\n Decoding Heads","summary":" The inference process in Large Language Models (LLMs) is often limited due to\nthe absence of parallelism in the auto-regressive decoding process, resulting\nin most operations being restricted by the memory bandwidth of accelerators.\nWhile methods such as speculative decoding have been suggested to address this\nissue, their implementation is impeded by the challenges associated with\nacquiring and maintaining a separate draft model. In this paper, we present\nMedusa, an efficient method that augments LLM inference by adding extra\ndecoding heads to predict multiple subsequent tokens in parallel. Using a\ntree-based attention mechanism, Medusa constructs multiple candidate\ncontinuations and verifies them simultaneously in each decoding step. By\nleveraging parallel processing, Medusa introduces only minimal overhead in\nterms of single-step latency while substantially reducing the number of\ndecoding steps required.\n We present two levels of fine-tuning procedures for Medusa to meet the needs\nof different use cases: Medusa-1: Medusa is directly fine-tuned on top of a\nfrozen backbone LLM, enabling lossless inference acceleration. Medusa-2: Medusa\nis fine-tuned together with the backbone LLM, enabling better prediction\naccuracy of Medusa heads and higher speedup but needing a special training\nrecipe that preserves the backbone model's capabilities.\n Moreover, we propose several extensions that improve or expand the utility of\nMedusa, including a self-distillation to handle situations where no training\ndata is available and a typical acceptance scheme to boost the acceptance rate\nwhile maintaining generation quality. We evaluate Medusa on models of various\nsizes and training procedures. Our experiments demonstrate that Medusa-1 can\nachieve over 2.2x speedup without compromising generation quality, while\nMedusa-2 further improves the speedup to 2.3-3.6x.\n","authors":["Tianle Cai","Yuhong Li","Zhengyang Geng","Hongwu Peng","Jason D. Lee","Deming Chen","Tri Dao"],"pdf_url":"https://arxiv.org/pdf/2401.10774v1.pdf","comment":"The code for this implementation is available at\n https://github.com/FasterDecoding/Medusa"},{"id":"http://arxiv.org/abs/2401.10768v1","updated":"2024-01-19T15:39:49Z","published":"2024-01-19T15:39:49Z","title":"Mitigating Hallucinations of Large Language Models via Knowledge\n Consistent Alignment","summary":" While Large Language Models (LLMs) have proven to be exceptional on a variety\nof tasks after alignment, they may still produce responses that contradict the\ncontext or world knowledge confidently, a phenomenon known as\n``hallucination''. In this paper, we demonstrate that reducing the\ninconsistency between the external knowledge encapsulated in the training data\nand the intrinsic knowledge inherited in the pretraining corpus could mitigate\nhallucination in alignment. Specifically, we introduce a novel knowledge\nconsistent alignment (KCA) approach, which involves automatically formulating\nexaminations based on external knowledge for accessing the comprehension of\nLLMs. For data encompassing knowledge inconsistency, KCA implements several\nsimple yet efficient strategies for processing. We illustrate the superior\nperformance of the proposed KCA approach in mitigating hallucinations across\nsix benchmarks using LLMs of different backbones and scales. Furthermore, we\nconfirm the correlation between knowledge inconsistency and hallucination,\nsignifying the effectiveness of reducing knowledge inconsistency in alleviating\nhallucinations. Our code, model weights, and data are public at\n\\url{https://github.com/fanqiwan/KCA}.\n","authors":["Fanqi Wan","Xinting Huang","Leyang Cui","Xiaojun Quan","Wei Bi","Shuming Shi"],"pdf_url":"https://arxiv.org/pdf/2401.10768v1.pdf","comment":"Work in progress"},{"id":"http://arxiv.org/abs/2306.16143v4","updated":"2024-01-19T15:05:14Z","published":"2023-06-28T12:17:45Z","title":"Generative User-Experience Research for Developing Domain-specific\n Natural Language Processing Applications","summary":" User experience (UX) is a part of human-computer interaction (HCI) research\nand focuses on increasing intuitiveness, transparency, simplicity, and trust\nfor the system users. Most UX research for machine learning (ML) or natural\nlanguage processing (NLP) focuses on a data-driven methodology. It engages\ndomain users mainly for usability evaluation. Moreover, more typical UX methods\ntailor the systems towards user usability, unlike learning about the user needs\nfirst. This paper proposes a new methodology for integrating generative UX\nresearch into developing domain NLP applications. Generative UX research\nemploys domain users at the initial stages of prototype development, i.e.,\nideation and concept evaluation, and the last stage for evaluating system\nusefulness and user utility. The methodology emerged from and is evaluated on a\ncase study about the full-cycle prototype development of a domain-specific\nsemantic search for daily operations in the process industry. A key finding of\nour case study is that involving domain experts increases their interest and\ntrust in the final NLP application. The combined UX+NLP research of the\nproposed method efficiently considers data- and user-driven opportunities and\nconstraints, which can be crucial for developing NLP applications.\n","authors":["Anastasia Zhukova","Lukas von Sperl","Christian E. Matt","Bela Gipp"],"pdf_url":"https://arxiv.org/pdf/2306.16143v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10716v1","updated":"2024-01-19T14:27:44Z","published":"2024-01-19T14:27:44Z","title":"Structured Code Representations Enable Data-Efficient Adaptation of Code\n Language Models","summary":" Current language models tailored for code tasks often adopt the\npre-training-then-fine-tuning paradigm from natural language processing,\nmodeling source code as plain text. This approach, however, overlooks the\nunambiguous structures inherent in programming languages. In this work, we\nexplore data-efficient adaptation of pre-trained code models by further\npre-training and fine-tuning them with program structures. Specifically, we\nrepresent programs as parse trees -- also known as concrete syntax trees (CSTs)\n-- and adapt pre-trained models on serialized CSTs. Although the models that we\nadapt have been pre-trained only on the surface form of programs, we find that\na small amount of continual pre-training and fine-tuning on CSTs without\nchanging the model architecture yields improvements over the baseline approach\nacross various code tasks. The improvements are found to be particularly\nsignificant when there are limited training examples, demonstrating the\neffectiveness of integrating program structures with plain-text representation\neven when working with backbone models that have not been pre-trained with\nstructures.\n","authors":["Mayank Agarwal","Yikang Shen","Bailin Wang","Yoon Kim","Jie Chen"],"pdf_url":"https://arxiv.org/pdf/2401.10716v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10712v1","updated":"2024-01-19T14:22:29Z","published":"2024-01-19T14:22:29Z","title":"Q&A Prompts: Discovering Rich Visual Clues through Mining\n Question-Answer Prompts for VQA requiring Diverse World Knowledge","summary":" With the breakthrough of multi-modal large language models, answering complex\nvisual questions that demand advanced reasoning abilities and world knowledge\nhas become a much more important testbed for developing AI models than ever.\nHowever, equipping AI models with robust cross-modality reasoning ability\nremains challenging since the cognition scheme of humans has not been\nunderstood systematically. In this paper, we believe that if we can collect\nvisual clues in the given image as much as possible, we will recognize the\nimage more accurately, understand the question better, recall relevant\nknowledge more easily, and finally reason out the answer. We discover these\nrich visual clues by mining question-answer pairs in images and sending them\ninto multi-modal large language models as prompts. We call the proposed method\nQ&A Prompts. Specifically, we first use the image-answer pairs and the\ncorresponding questions in the training set as inputs and outputs to train a\nvisual question generation model. Then, we use an image tagging model to\nidentify various instances and send packaged image-tag pairs into the visual\nquestion generation model to generate relevant questions with the extracted\nimage tags as answers. Finally, we encode these generated question-answer pairs\nas prompts with a visual-aware prompting module and send them into pre-trained\nmulti-modal large language models to reason out the final answers. Experimental\nresults show that, compared with state-of-the-art methods, our Q&A Prompts\nachieves substantial improvements on the challenging visual question answering\ndatasets requiring reasoning over diverse world knowledge, such as OK-VQA and\nA-OKVQA.\n","authors":["Haibi Wang","Weifeng Ge"],"pdf_url":"https://arxiv.org/pdf/2401.10712v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10711v1","updated":"2024-01-19T14:21:46Z","published":"2024-01-19T14:21:46Z","title":"Weakly Supervised Gaussian Contrastive Grounding with Large Multimodal\n Models for Video Question Answering","summary":" Video Question Answering (VideoQA) aims to answer natural language questions\nbased on the information observed in videos. Despite the recent success of\nLarge Multimodal Models (LMMs) in image-language understanding and reasoning,\nthey deal with VideoQA insufficiently by simply taking uniformly sampled frames\nas visual inputs, which ignores question-relevant visual clues. Moreover, there\nare no human annotations for question-critical timestamps in existing VideoQA\ndatasets. In light of this, we propose a novel weakly supervised framework to\nenforce the LMMs to reason out the answers with question-critical moments as\nvisual inputs. Specifically, we fuse the question and answer pairs as event\ndescriptions to find multiple keyframes as target moments, which will be\npseudo-labels. With these pseudo-labels as additionally weak supervision, we\ndevise a lightweight Gaussian-based Contrastive Grounding (GCG) module. GCG\nlearns multiple Gaussian functions to characterize the temporal structure of\nthe video, and sample question-critical frames as positive moments to be the\nvisual inputs of LMMs. Extensive experiments on several VideoQA benchmarks\nverify the effectiveness of our framework, and we achieve substantial\nimprovements compared to previous state-of-the-art methods.\n","authors":["Haibo Wang","Chenghang Lai","Yixuan Sun","Weifeng Ge"],"pdf_url":"https://arxiv.org/pdf/2401.10711v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10695v1","updated":"2024-01-19T14:00:19Z","published":"2024-01-19T14:00:19Z","title":"LangBridge: Multilingual Reasoning Without Multilingual Supervision","summary":" We introduce LangBridge, a zero-shot approach to adapt language models for\nmultilingual reasoning tasks without multilingual supervision. LangBridge\noperates by bridging two models, each specialized in different aspects: (1) one\nspecialized in understanding multiple languages (e.g., mT5 encoder) and (2) one\nspecialized in reasoning (e.g., Orca 2). LangBridge connects the two models by\nintroducing minimal trainable parameters between them. Despite utilizing only\nEnglish data for training, LangBridge considerably enhances the performance of\nlanguage models on low-resource languages across mathematical reasoning,\ncoding, and logical reasoning. Our analysis suggests that the efficacy of\nLangBridge stems from the language-agnostic characteristics of multilingual\nrepresentations. We publicly release our code and models.\n","authors":["Dongkeun Yoon","Joel Jang","Sungdong Kim","Seungone Kim","Sheikh Shafayat","Minjoon Seo"],"pdf_url":"https://arxiv.org/pdf/2401.10695v1.pdf","comment":"Work in progress"},{"id":"http://arxiv.org/abs/2401.09343v2","updated":"2024-01-19T13:33:22Z","published":"2024-01-17T17:08:36Z","title":"Efficient slot labelling","summary":" Slot labelling is an essential component of any dialogue system, aiming to\nfind important arguments in every user turn. Common approaches involve large\npre-trained language models (PLMs) like BERT or RoBERTa, but they face\nchallenges such as high computational requirements and dependence on\npre-training data. In this work, we propose a lightweight method which performs\non par or better than the state-of-the-art PLM-based methods, while having\nalmost 10x less trainable parameters. This makes it especially applicable for\nreal-life industry scenarios.\n","authors":["Vladimir Vlasov"],"pdf_url":"https://arxiv.org/pdf/2401.09343v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.10444v3","updated":"2024-01-19T13:19:13Z","published":"2023-09-19T09:04:15Z","title":"Exploring Iterative Enhancement for Improving Learnersourced\n Multiple-Choice Question Explanations with Large Language Models","summary":" Large language models exhibit superior capabilities in processing and\nunderstanding language, yet their applications in educational contexts remain\nunderexplored. Learnersourcing enhances learning by engaging students in\ncreating their own educational content. When learnersourcing multiple-choice\nquestions, creating explanations for the solution of a question is a crucial\nstep; it helps other students understand the solution and promotes a deeper\nunderstanding of related concepts. However, it is often difficult for students\nto craft effective solution explanations, due to limited subject understanding.\nTo help scaffold the task of automated explanation generation, we present and\nevaluate a framework called \"ILearner-LLM\", that iteratively enhances the\ngenerated explanations for the given questions with large language models.\nComprising an explanation generation model and an explanation evaluation model,\nthe framework generates high-quality student-aligned explanations by\niteratively feeding the quality rating score from the evaluation model back\ninto the instruction prompt of the explanation generation model. Experimental\nresults demonstrate the effectiveness of our ILearner-LLM on LLaMA2-13B and\nGPT-4 to generate higher quality explanations that are closer to those written\nby students on five PeerWise datasets. Our findings represent a promising path\nto enrich the learnersourcing experience for students and to enhance the\ncapabilities of large language models for educational applications.\n","authors":["Qiming Bao","Juho Leinonen","Alex Yuxuan Peng","Wanjun Zhong","Gaël Gendron","Timothy Pistotti","Alice Huang","Paul Denny","Michael Witbrock","Jiamou Liu"],"pdf_url":"https://arxiv.org/pdf/2309.10444v3.pdf","comment":"Preprint. Under review"},{"id":"http://arxiv.org/abs/2306.00168v3","updated":"2024-01-19T13:05:04Z","published":"2023-05-31T20:25:08Z","title":"Measuring the Robustness of NLP Models to Domain Shifts","summary":" Existing research on Domain Robustness (DR) suffers from disparate setups,\nlack of task variety, and scarce research on recent models and capabilities\nsuch as few-shot learning. Furthermore, we claim that the common practice of\nmeasuring DR might further obscure the picture. Current research focuses on\nchallenge sets and relies solely on the Source Drop (SD): Using the source\nin-domain performance as a reference point for degradation. However, the Target\nDrop (TD) should be used as a complementary point of view. To understand the DR\nchallenge in modern NLP models, we developed a benchmark comprised of seven NLP\ntasks, including classification, QA, and generation. Our benchmark focuses on\nnatural topical domain shifts and enables measuring both the SD and the TD. Our\ncomprehensive study, involving over 14,000 domain shifts across 18 fine-tuned\nand few-shot models, shows that both models suffer from drops upon domain\nshifts. While fine-tuned models excel in-domain, few-shot LLMs often surpass\nthem cross-domain, showing better robustness. In addition, we found that a\nlarge SD can be explained by shifting to a harder domain rather than a genuine\nDR challenge. Thus, the TD is a more reliable metric.\n","authors":["Nitay Calderon","Naveh Porat","Eyal Ben-David","Alexander Chapanin","Zorik Gekhman","Nadav Oved","Vitaly Shalumov","Roi Reichart"],"pdf_url":"https://arxiv.org/pdf/2306.00168v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.01185v2","updated":"2024-01-19T12:34:07Z","published":"2023-12-02T17:24:17Z","title":"A ripple in time: a discontinuity in American history","summary":" In this note we use the State of the Union Address (SOTU) dataset from Kaggle\nto make some surprising (and some not so surprising) observations pertaining to\nthe general timeline of American history, and the character and nature of the\naddresses themselves. Our main approach is using vector embeddings, such as\nBERT (DistilBERT) and GPT-2.\n While it is widely believed that BERT (and its variations) is most suitable\nfor NLP classification tasks, we find out that GPT-2 in conjunction with\nnonlinear dimension reduction methods such as UMAP provide better separation\nand stronger clustering. This makes GPT-2 + UMAP an interesting alternative. In\nour case, no model fine-tuning is required, and the pre-trained out-of-the-box\nGPT-2 model is enough.\n We also used a fine-tuned DistilBERT model for classification detecting which\nPresident delivered which address, with very good results (accuracy 93\\% - 95\\%\ndepending on the run). An analogous task was performed to determine the year of\nwriting, and we were able to pin it down to about 4 years (which is a single\npresidential term).\n It is worth noting that SOTU addresses provide relatively small writing\nsamples (with about 8000 words on average, and varying widely from under 2000\nwords to more than 20000), and that the amount of authors is relatively large\n(we used SOTU addresses of 42 US presidents). This shows that the techniques\nemployed turn out to be rather efficient, while all the computations described\nin this note can be performed using a single GPU instance of Google Colab.\n The accompanying code is available on GitHub.\n","authors":["Alexander Kolpakov","Igor Rivin"],"pdf_url":"https://arxiv.org/pdf/2312.01185v2.pdf","comment":"7 pages, 8 figures; GitHub repository\n https://github.com/sashakolpakov/ripple_in_time"},{"id":"http://arxiv.org/abs/2401.10660v1","updated":"2024-01-19T12:26:57Z","published":"2024-01-19T12:26:57Z","title":"A Simple Framework to Accelerate Multilingual Language Model for\n Monolingual Text Generation","summary":" Recent advancements in large language models have facilitated the execution\nof complex language tasks, not only in English but also in non-English\nlanguages. However, the tokenizers of most language models, such as Llama,\ntrained on English-centric corpora, tend to excessively fragment tokens in\nnon-English languages. This issue is especially pronounced in non-roman\nalphabetic languages, which are often divided at a character or even Unicode\nlevel, leading to slower text generation. To address this, our study introduces\na novel framework designed to expedite text generation in these languages. This\nframework predicts larger linguistic units than those of conventional\nmultilingual tokenizers and is specifically tailored to the target language,\nthereby reducing the number of decoding steps required. Our empirical results\ndemonstrate that the proposed framework increases the generation speed by a\nfactor of 1.9 compared to standard decoding while maintaining the performance\nof a pre-trained multilingual model on monolingual tasks.\n","authors":["Jimin Hong","Gibbeum Lee","Jaewoong Cho"],"pdf_url":"https://arxiv.org/pdf/2401.10660v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10653v1","updated":"2024-01-19T11:59:13Z","published":"2024-01-19T11:59:13Z","title":"Attentive Fusion: A Transformer-based Approach to Multimodal Hate Speech\n Detection","summary":" With the recent surge and exponential growth of social media usage,\nscrutinizing social media content for the presence of any hateful content is of\nutmost importance. Researchers have been diligently working since the past\ndecade on distinguishing between content that promotes hatred and content that\ndoes not. Traditionally, the main focus has been on analyzing textual content.\nHowever, recent research attempts have also commenced into the identification\nof audio-based content. Nevertheless, studies have shown that relying solely on\naudio or text-based content may be ineffective, as recent upsurge indicates\nthat individuals often employ sarcasm in their speech and writing. To overcome\nthese challenges, we present an approach to identify whether a speech promotes\nhate or not utilizing both audio and textual representations. Our methodology\nis based on the Transformer framework that incorporates both audio and text\nsampling, accompanied by our very own layer called \"Attentive Fusion\". The\nresults of our study surpassed previous state-of-the-art techniques, achieving\nan impressive macro F1 score of 0.927 on the Test Set.\n","authors":["Atanu Mandal","Gargi Roy","Amit Barman","Indranil Dutta","Sudip Kumar Naskar"],"pdf_url":"https://arxiv.org/pdf/2401.10653v1.pdf","comment":"Accepted in 20th International Conference on Natural Language\n Processing (ICON)"},{"id":"http://arxiv.org/abs/2401.10647v1","updated":"2024-01-19T11:48:09Z","published":"2024-01-19T11:48:09Z","title":"Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language\n Models","summary":" In the rapidly advancing field of artificial intelligence, the concept of\nRed-Teaming or Jailbreaking large language models (LLMs) has emerged as a\ncrucial area of study. This approach is especially significant in terms of\nassessing and enhancing the safety and robustness of these models. This paper\ninvestigates the intricate consequences of such modifications through model\nediting, uncovering a complex relationship between enhancing model accuracy and\npreserving its ethical integrity. Our in-depth analysis reveals a striking\nparadox: while injecting accurate information is crucial for model reliability,\nit can paradoxically destabilize the model's foundational framework, resulting\nin unpredictable and potentially unsafe behaviors. Additionally, we propose a\nbenchmark dataset NicheHazardQA to investigate this unsafe behavior both within\nthe same and cross topical domain. This aspect of our research sheds light on\nhow the edits, impact the model's safety metrics and guardrails. Our findings\nshow that model editing serves as a cost-effective tool for topical red-teaming\nby methodically applying targeted edits and evaluating the resultant model\nbehavior\n","authors":["Rima Hazra","Sayan Layek","Somnath Banerjee","Soujanya Poria"],"pdf_url":"https://arxiv.org/pdf/2401.10647v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.13274v2","updated":"2024-01-19T10:06:50Z","published":"2023-11-22T09:51:53Z","title":"Enhancing Summarization Performance through Transformer-Based Prompt\n Engineering in Automated Medical Reporting","summary":" Customized medical prompts enable Large Language Models (LLM) to effectively\naddress medical dialogue summarization. The process of medical reporting is\noften time-consuming for healthcare professionals. Implementing medical\ndialogue summarization techniques presents a viable solution to alleviate this\ntime constraint by generating automated medical reports. The effectiveness of\nLLMs in this process is significantly influenced by the formulation of the\nprompt, which plays a crucial role in determining the quality and relevance of\nthe generated reports. In this research, we used a combination of two distinct\nprompting strategies, known as shot prompting and pattern prompting to enhance\nthe performance of automated medical reporting. The evaluation of the automated\nmedical reports is carried out using the ROUGE score and a human evaluation\nwith the help of an expert panel. The two-shot prompting approach in\ncombination with scope and domain context outperforms other methods and\nachieves the highest score when compared to the human reference set by a\ngeneral practitioner. However, the automated reports are approximately twice as\nlong as the human references, due to the addition of both redundant and\nrelevant statements that are added to the report.\n","authors":["Daphne van Zandvoort","Laura Wiersema","Tom Huibers","Sandra van Dulmen","Sjaak Brinkkemper"],"pdf_url":"https://arxiv.org/pdf/2311.13274v2.pdf","comment":"12 pages, 4 figures, to be presented at HEALTHINF 2024, author\n contributions: research conducted and written by Daphne van Zandvoort and\n Laura Wiersema, research suggested and used software created by Tom Huibers,\n data provided and feedback provided by Sandra van Dulmen, supervision and\n feedback provided by Sjaak Brinkkemper"},{"id":"http://arxiv.org/abs/2311.12399v3","updated":"2024-01-19T09:49:46Z","published":"2023-11-21T07:22:48Z","title":"A Survey of Graph Meets Large Language Model: Progress and Future\n Directions","summary":" Graph plays a significant role in representing and analyzing complex\nrelationships in real-world applications such as citation networks, social\nnetworks, and biological data. Recently, Large Language Models (LLMs), which\nhave achieved tremendous success in various domains, have also been leveraged\nin graph-related tasks to surpass traditional Graph Neural Networks (GNNs)\nbased methods and yield state-of-the-art performance. In this survey, we first\npresent a comprehensive review and analysis of existing methods that integrate\nLLMs with graphs. First of all, we propose a new taxonomy, which organizes\nexisting methods into three categories based on the role (i.e., enhancer,\npredictor, and alignment component) played by LLMs in graph-related tasks. Then\nwe systematically survey the representative methods along the three categories\nof the taxonomy. Finally, we discuss the remaining limitations of existing\nstudies and highlight promising avenues for future research. The relevant\npapers are summarized and will be consistently updated at:\nhttps://github.com/yhLeeee/Awesome-LLMs-in-Graph-tasks.\n","authors":["Yuhan Li","Zhixun Li","Peisong Wang","Jia Li","Xiangguo Sun","Hong Cheng","Jeffrey Xu Yu"],"pdf_url":"https://arxiv.org/pdf/2311.12399v3.pdf","comment":"Work in progress; 13 pages, 5 figures"},{"id":"http://arxiv.org/abs/2401.10580v1","updated":"2024-01-19T09:46:08Z","published":"2024-01-19T09:46:08Z","title":"PHOENIX: Open-Source Language Adaption for Direct Preference\n Optimization","summary":" Large language models have gained immense importance in recent years and have\ndemonstrated outstanding results in solving various tasks. However, despite\nthese achievements, many questions remain unanswered in the context of large\nlanguage models. Besides the optimal use of the models for inference and the\nalignment of the results to the desired specifications, the transfer of models\nto other languages is still an underdeveloped area of research. The recent\npublication of models such as Llama-2 and Zephyr has provided new insights into\narchitectural improvements and the use of human feedback. However, insights\ninto adapting these techniques to other languages remain scarce. In this paper,\nwe build on latest improvements and apply the Direct Preference\nOptimization(DPO) approach to the German language. The model is available at\nhttps://huggingface.co/DRXD1000/Phoenix.\n","authors":["Matthias Uhlig","Sigurd Schacht","Sudarshan Kamath Barkur"],"pdf_url":"https://arxiv.org/pdf/2401.10580v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10567v1","updated":"2024-01-19T09:13:28Z","published":"2024-01-19T09:13:28Z","title":"Self-training from Self-memory in Data-to-text Generation","summary":" This paper introduces a novel training model, self-training from self-memory\n(STSM) in data-to-text generation (DTG), allowing the model to self-train on\nsubsets, including self-memory as outputs inferred directly from the trained\nmodels and/or the new data. The quality of self-memory is validated by two\nmodels, data-to-text (D2T) and text-to-data (T2D), by two pre-defined\nconditions: (1) the appearance of all source values in the outputs of the D2T\nmodel and (2) the ability to convert back to source data in the outputs in the\nT2D model. We utilize a greedy algorithm to generate shorter D2T outputs if\nthey contain all source values. Subsequently, we use the T2D model to confirm\nthat these outputs can capture input relationships by demonstrating their\ncapacity to convert text back into data. With 30% of the dataset, we can train\nthe D2T model with a competitive performance compared to full training in the\nsame setup. We experiment with our model on two datasets, E2E NLG and DART.\nSTSM offers the D2T model a generalization capability from its subset memory\nwhile reducing training data volume. Ultimately, we anticipate that this paper\nwill contribute to continual learning solutions that adapt to new training\ndata, incorporating it as a form of self-memory in DTG tasks. The curated\ndataset is publicly available at: https://github.com/hoangthangta/STSM.\n","authors":["Hoang-Thang Ta"],"pdf_url":"https://arxiv.org/pdf/2401.10567v1.pdf","comment":"14 pages"},{"id":"http://arxiv.org/abs/2401.09566v2","updated":"2024-01-19T08:57:19Z","published":"2024-01-17T19:43:43Z","title":"Aligning Large Language Models with Counterfactual DPO","summary":" Advancements in large language models (LLMs) have demonstrated remarkable\ncapabilities across a diverse range of applications. These models excel in\ngenerating text completions that are contextually coherent and cover an\nextensive array of subjects. However, the vast datasets required for their\ntraining make aligning response styles during the pretraining and instruction\ntuning phases challenging. Consequently, an additional alignment phase is\ntypically employed, wherein the model is further trained with human preference\ndata to better align its outputs with human expectations. While this process\ndoesn't introduce new capabilities per se, it does accentuate generation styles\ninnate to the model. This paper explores the utilization of counterfactual\nprompting within the framework of Direct Preference Optimization (DPO) to align\nthe model's style without relying on human intervention. We demonstrate that\nthis method effectively instils desirable behaviour, mitigates undesirable\nones, and encourages the model to disregard inappropriate instructions. Our\nfindings suggest that counterfactual prompting with DPO presents a low-resource\nway to fine-tune LLMs to meet the demands for responsible and ethically aligned\nAI systems.\n","authors":["Bradley Butcher"],"pdf_url":"https://arxiv.org/pdf/2401.09566v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10559v1","updated":"2024-01-19T08:50:54Z","published":"2024-01-19T08:50:54Z","title":"OrchMoE: Efficient Multi-Adapter Learning with Task-Skill Synergy","summary":" We advance the field of Parameter-Efficient Fine-Tuning (PEFT) with our novel\nmulti-adapter method, OrchMoE, which capitalizes on modular skill architecture\nfor enhanced forward transfer in neural networks. Unlike prior models that\ndepend on explicit task identification inputs, OrchMoE automatically discerns\ntask categories, streamlining the learning process. This is achieved through an\nintegrated mechanism comprising an Automatic Task Classification module and a\nTask-Skill Allocation module, which collectively deduce task-specific\nclassifications and tailor skill allocation matrices. Our extensive evaluations\non the 'Super Natural Instructions' dataset, featuring 1,600 diverse\ninstructional tasks, indicate that OrchMoE substantially outperforms comparable\nmulti-adapter baselines in terms of both performance and sample utilization\nefficiency, all while operating within the same parameter constraints. These\nfindings suggest that OrchMoE offers a significant leap forward in multi-task\nlearning efficiency.\n","authors":["Haowen Wang","Tao Sun","Kaixiang Ji","Jian Wang","Cong Fan","Jinjie Gu"],"pdf_url":"https://arxiv.org/pdf/2401.10559v1.pdf","comment":"9 pages, 3 figures"},{"id":"http://arxiv.org/abs/2401.08326v2","updated":"2024-01-19T08:48:37Z","published":"2024-01-16T12:45:15Z","title":"RoTBench: A Multi-Level Benchmark for Evaluating the Robustness of Large\n Language Models in Tool Learning","summary":" Tool learning has generated widespread interest as a vital means of\ninteraction between Large Language Models (LLMs) and the physical world.\nCurrent research predominantly emphasizes LLMs' capacity to utilize tools in\nwell-structured environments while overlooking their stability when confronted\nwith the inevitable noise of the real world. To bridge this gap, we introduce\nRoTBench, a multi-level benchmark for evaluating the robustness of LLMs in tool\nlearning. Specifically, we establish five external environments, each featuring\nvarying levels of noise (i.e., Clean, Slight, Medium, Heavy, and Union),\nproviding an in-depth analysis of the model's resilience across three critical\nphases: tool selection, parameter identification, and content filling.\nExperiments involving six widely-used models underscore the urgent necessity\nfor enhancing the robustness of LLMs in tool learning. For instance, the\nperformance of GPT-4 even drops significantly from 80.00 to 58.10 when there is\nno substantial change in manual accuracy. More surprisingly, the noise\ncorrection capability inherent in the GPT family paradoxically impedes its\nadaptability in the face of mild noise. In light of these findings, we propose\nRoTTuning, a strategy that enriches the diversity of training environments to\nbolster the robustness of LLMs in tool learning. The code and data are\navailable at https://github.com/Junjie-Ye/RoTBench.\n","authors":["Junjie Ye","Yilong Wu","Songyang Gao","Caishuang Huang","Sixian Li","Guanyu Li","Xiaoran Fan","Qi Zhang","Tao Gui","Xuanjing Huang"],"pdf_url":"https://arxiv.org/pdf/2401.08326v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10543v1","updated":"2024-01-19T08:02:37Z","published":"2024-01-19T08:02:37Z","title":"Multilingual acoustic word embeddings for zero-resource languages","summary":" This research addresses the challenge of developing speech applications for\nzero-resource languages that lack labelled data. It specifically uses acoustic\nword embedding (AWE) -- fixed-dimensional representations of variable-duration\nspeech segments -- employing multilingual transfer, where labelled data from\nseveral well-resourced languages are used for pertaining. The study introduces\na new neural network that outperforms existing AWE models on zero-resource\nlanguages. It explores the impact of the choice of well-resourced languages.\nAWEs are applied to a keyword-spotting system for hate speech detection in\nSwahili radio broadcasts, demonstrating robustness in real-world scenarios.\nAdditionally, novel semantic AWE models improve semantic query-by-example\nsearch.\n","authors":["Christiaan Jacobs","Herman Kamper"],"pdf_url":"https://arxiv.org/pdf/2401.10543v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.14995v2","updated":"2024-01-19T07:47:01Z","published":"2023-07-27T16:45:33Z","title":"TransNormerLLM: A Faster and Better Large Language Model with Improved\n TransNormer","summary":" We present TransNormerLLM, the first linear attention-based Large Language\nModel (LLM) that outperforms conventional softmax attention-based models in\nterms of both accuracy and efficiency. TransNormerLLM evolves from the previous\nlinear attention architecture TransNormer by making advanced modifications that\ninclude positional embedding, linear attention acceleration, gating mechanisms,\ntensor normalization, and inference acceleration and stabilization.\nSpecifically, we use LRPE together with an exponential decay to avoid attention\ndilution issues while allowing the model to retain global interactions between\ntokens. Additionally, we propose Lightning Attention, a cutting-edge technique\nthat accelerates linear attention by more than twice in runtime and reduces\nmemory usage by a remarkable four times. To further enhance the performance of\nTransNormer, we leverage a gating mechanism for smooth training and a new\ntensor normalization scheme to accelerate the model, resulting in an impressive\nacceleration of over $20\\%$. Furthermore, we develop a robust inference\nalgorithm that ensures numerical stability and consistent inference speed,\nregardless of the sequence length, showcasing superior efficiency during both\ntraining and inference stages. We also implement an efficient model parallel\nschema for TransNormerLLM, enabling seamless deployment on large-scale clusters\nand facilitating expansion to even more extensive models, i.e., LLMs with 175B\nparameters. We validate our model design through a series of ablations and\ntrain models with sizes of 385M, 1B, and 7B on our self-collected corpus.\nBenchmark results demonstrate that our models not only match the performance of\nstate-of-the-art LLMs with Transformer but are also significantly faster. Code\nis released at: https://github.com/OpenNLPLab/TransnormerLLM.\n","authors":["Zhen Qin","Dong Li","Weigao Sun","Weixuan Sun","Xuyang Shen","Xiaodong Han","Yunshen Wei","Baohong Lv","Xiao Luo","Yu Qiao","Yiran Zhong"],"pdf_url":"https://arxiv.org/pdf/2307.14995v2.pdf","comment":"Technical Report. Yiran Zhong is the corresponding author. Zhen Qin,\n Dong Li, Weigao Sun, Weixuan Sun, Xuyang Shen contribute equally to this\n paper. Code is released at: https://github.com/OpenNLPLab/TransnormerLLM"},{"id":"http://arxiv.org/abs/2401.10536v1","updated":"2024-01-19T07:30:57Z","published":"2024-01-19T07:30:57Z","title":"Speech Swin-Transformer: Exploring a Hierarchical Transformer with\n Shifted Windows for Speech Emotion Recognition","summary":" Swin-Transformer has demonstrated remarkable success in computer vision by\nleveraging its hierarchical feature representation based on Transformer. In\nspeech signals, emotional information is distributed across different scales of\nspeech features, e.\\,g., word, phrase, and utterance. Drawing above\ninspiration, this paper presents a hierarchical speech Transformer with shifted\nwindows to aggregate multi-scale emotion features for speech emotion\nrecognition (SER), called Speech Swin-Transformer. Specifically, we first\ndivide the speech spectrogram into segment-level patches in the time domain,\ncomposed of multiple frame patches. These segment-level patches are then\nencoded using a stack of Swin blocks, in which a local window Transformer is\nutilized to explore local inter-frame emotional information across frame\npatches of each segment patch. After that, we also design a shifted window\nTransformer to compensate for patch correlations near the boundaries of segment\npatches. Finally, we employ a patch merging operation to aggregate\nsegment-level emotional features for hierarchical speech representation by\nexpanding the receptive field of Transformer from frame-level to segment-level.\nExperimental results demonstrate that our proposed Speech Swin-Transformer\noutperforms the state-of-the-art methods.\n","authors":["Yong Wang","Cheng Lu","Hailun Lian","Yan Zhao","Björn Schuller","Yuan Zong","Wenming Zheng"],"pdf_url":"https://arxiv.org/pdf/2401.10536v1.pdf","comment":"Accepted by ICASSP 2024"},{"id":"http://arxiv.org/abs/2401.10535v1","updated":"2024-01-19T07:21:45Z","published":"2024-01-19T07:21:45Z","title":"The \"Colonial Impulse\" of Natural Language Processing: An Audit of\n Bengali Sentiment Analysis Tools and Their Identity-based Biases","summary":" While colonization has sociohistorically impacted people's identities across\nvarious dimensions, those colonial values and biases continue to be perpetuated\nby sociotechnical systems. One category of sociotechnical systems--sentiment\nanalysis tools--can also perpetuate colonial values and bias, yet less\nattention has been paid to how such tools may be complicit in perpetuating\ncoloniality, although they are often used to guide various practices (e.g.,\ncontent moderation). In this paper, we explore potential bias in sentiment\nanalysis tools in the context of Bengali communities that have experienced and\ncontinue to experience the impacts of colonialism. Drawing on identity\ncategories most impacted by colonialism amongst local Bengali communities, we\nfocused our analytic attention on gender, religion, and nationality. We\nconducted an algorithmic audit of all sentiment analysis tools for Bengali,\navailable on the Python package index (PyPI) and GitHub. Despite similar\nsemantic content and structure, our analyses showed that in addition to\ninconsistencies in output from different tools, Bengali sentiment analysis\ntools exhibit bias between different identity categories and respond\ndifferently to different ways of identity expression. Connecting our findings\nwith colonially shaped sociocultural structures of Bengali communities, we\ndiscuss the implications of downstream bias of sentiment analysis tools.\n","authors":["Dipto Das","Shion Guha","Jed Brubaker","Bryan Semaan"],"pdf_url":"https://arxiv.org/pdf/2401.10535v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10529v1","updated":"2024-01-19T07:10:13Z","published":"2024-01-19T07:10:13Z","title":"Mementos: A Comprehensive Benchmark for Multimodal Large Language Model\n Reasoning over Image Sequences","summary":" Multimodal Large Language Models (MLLMs) have demonstrated proficiency in\nhandling a variety of visual-language tasks. However, current MLLM benchmarks\nare predominantly designed to evaluate reasoning based on static information\nabout a single image, and the ability of modern MLLMs to extrapolate from image\nsequences, which is essential for understanding our ever-changing world, has\nbeen less investigated. To address this challenge, this paper introduces\nMementos, a new benchmark designed to assess MLLMs' sequential image reasoning\nabilities. Mementos features 4,761 diverse image sequences with varying\nlengths. We also employ a GPT-4 assisted method to evaluate MLLM reasoning\nperformance. Through a careful evaluation of nine recent MLLMs on Mementos,\nincluding GPT-4V and Gemini, we find that they struggle to accurately describe\ndynamic information about given image sequences, often leading to\nhallucinations/misrepresentations of objects and their corresponding behaviors.\nOur quantitative analysis and case studies identify three key factors impacting\nMLLMs' sequential image reasoning: the correlation between object and\nbehavioral hallucinations, the influence of cooccurring behaviors, and the\ncompounding impact of behavioral hallucinations. Our dataset is available at\nhttps://github.com/umd-huang-lab/Mementos.\n","authors":["Xiyao Wang","Yuhang Zhou","Xiaoyu Liu","Hongjin Lu","Yuancheng Xu","Feihong He","Jaehong Yoon","Taixi Lu","Gedas Bertasius","Mohit Bansal","Huaxiu Yao","Furong Huang"],"pdf_url":"https://arxiv.org/pdf/2401.10529v1.pdf","comment":"27 pages, 23 figures"},{"id":"http://arxiv.org/abs/2401.10521v1","updated":"2024-01-19T06:54:39Z","published":"2024-01-19T06:54:39Z","title":"Cross-lingual Editing in Multilingual Language Models","summary":" The training of large language models (LLMs) necessitates substantial data\nand computational resources, and updating outdated LLMs entails significant\nefforts and resources. While numerous model editing techniques (METs) have\nemerged to efficiently update model outputs without retraining, their\neffectiveness in multilingual LLMs, where knowledge is stored in diverse\nlanguages, remains an underexplored research area. This research paper\nintroduces the cross-lingual model editing (\\textbf{XME}) paradigm, wherein a\nfact is edited in one language, and the subsequent update propagation is\nobserved across other languages. To investigate the XME paradigm, we conducted\nexperiments using BLOOM, mBERT, and XLM-RoBERTa using the two writing scripts:\n\\textit{Latin} (English, French, and Spanish) and \\textit{Indic} (Hindi,\nGujarati, and Bengali). The results reveal notable performance limitations of\nstate-of-the-art METs under the XME setting, mainly when the languages involved\nbelong to two distinct script families. These findings highlight the need for\nfurther research and development of XME techniques to address these challenges.\nFor more comprehensive information, the dataset used in this research and the\nassociated code are publicly available at the following\nURL\\url{https://github.com/lingo-iitgn/XME}.\n","authors":["Himanshu Beniwal","Kowsik Nandagopan D","Mayank Singh"],"pdf_url":"https://arxiv.org/pdf/2401.10521v1.pdf","comment":"Accepted at EACL 2024"},{"id":"http://arxiv.org/abs/2312.15880v2","updated":"2024-01-19T06:42:16Z","published":"2023-12-26T04:22:56Z","title":"KnowledgeNavigator: Leveraging Large Language Models for Enhanced\n Reasoning over Knowledge Graph","summary":" Large language model (LLM) has achieved outstanding performance on various\ndownstream tasks with its powerful natural language understanding and zero-shot\ncapability, but LLM still suffers from knowledge limitation. Especially in\nscenarios that require long logical chains or complex reasoning, the\nhallucination and knowledge limitation of LLM limit its performance in question\nanswering (QA). In this paper, we propose a novel framework KnowledgeNavigator\nto address these challenges by efficiently and accurately retrieving external\nknowledge from knowledge graph and using it as a key factor to enhance LLM\nreasoning. Specifically, KnowledgeNavigator first mines and enhances the\npotential constraints of the given question to guide the reasoning. Then it\nretrieves and filters external knowledge that supports answering through\niterative reasoning on knowledge graph with the guidance of LLM and the\nquestion. Finally, KnowledgeNavigator constructs the structured knowledge into\neffective prompts that are friendly to LLM to help its reasoning. We evaluate\nKnowledgeNavigator on multiple public KGQA benchmarks, the experiments show the\nframework has great effectiveness and generalization, outperforming previous\nknowledge graph enhanced LLM methods and is comparable to the fully supervised\nmodels.\n","authors":["Tiezheng Guo","Qingwen Yang","Chen Wang","Yanyi Liu","Pan Li","Jiawei Tang","Dapeng Li","Yingyou Wen"],"pdf_url":"https://arxiv.org/pdf/2312.15880v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.05492v3","updated":"2024-01-19T06:06:46Z","published":"2023-10-09T07:56:16Z","title":"How Abilities in Large Language Models are Affected by Supervised\n Fine-tuning Data Composition","summary":" Large language models (LLMs) with enormous pre-training tokens and parameters\nemerge diverse abilities, including math reasoning, code generation, and\ninstruction following. These abilities are further enhanced by supervised\nfine-tuning (SFT). While the open-source community has explored ad-hoc SFT for\nenhancing individual capabilities, proprietary LLMs exhibit versatility across\nvarious skills. Therefore, understanding the facilitation of multiple abilities\nvia SFT is paramount. In this study, we specifically focuses on the interplay\nof data composition between mathematical reasoning, code generation, and\ngeneral human-aligning abilities during SFT. We propose four intriguing\nresearch questions to explore the association between model performance and\nvarious factors including data amount, composition ratio, model size and SFT\nstrategies. Our experiments reveal that distinct capabilities scale differently\nand larger models generally show superior performance with same amount of data.\nMathematical reasoning and code generation consistently improve with increasing\ndata amount, whereas general abilities plateau after roughly a thousand\nsamples. Moreover, we observe data composition appears to enhance various\nabilities under limited data conditions, yet can lead to performance conflicts\nwhen data is plentiful. Our findings also suggest the amount of composition\ndata influences performance more than the composition ratio. In analysis of SFT\nstrategies, we find that sequentially learning multiple skills risks\ncatastrophic forgetting. Our proposed Dual-stage Mixed Fine-tuning (DMT)\nstrategy offers a promising solution to learn multiple abilities with different\nscaling patterns.\n","authors":["Guanting Dong","Hongyi Yuan","Keming Lu","Chengpeng Li","Mingfeng Xue","Dayiheng Liu","Wei Wang","Zheng Yuan","Chang Zhou","Jingren Zhou"],"pdf_url":"https://arxiv.org/pdf/2310.05492v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10510v1","updated":"2024-01-19T05:58:30Z","published":"2024-01-19T05:58:30Z","title":"A match made in consistency heaven: when large language models meet\n evolutionary algorithms","summary":" Pre-trained large language models (LLMs) have powerful capabilities for\ngenerating creative natural text. Evolutionary algorithms (EAs) can discover\ndiverse solutions to complex real-world problems. Motivated by the common\ncollective and directionality of text sequence generation and evolution, this\npaper illustrates the strong consistency of LLMs and EAs, which includes\nmultiple one-to-one key characteristics: token embedding and genotype-phenotype\nmapping, position encoding and fitness shaping, position embedding and\nselection, attention and crossover, feed-forward neural network and mutation,\nmodel training and parameter update, and multi-task learning and\nmulti-objective optimization. Based on this consistency perspective, existing\ncoupling studies are analyzed, including evolutionary fine-tuning and\nLLM-enhanced EAs. Leveraging these insights, we outline a fundamental roadmap\nfor future research in coupling LLMs and EAs, while highlighting key challenges\nalong the way. The consistency not only reveals the evolution mechanism behind\nLLMs but also facilitates the development of evolved artificial agents that\napproach or surpass biological organisms.\n","authors":["Wang Chao","Jiaxuan Zhao","Licheng Jiao","Lingling Li","Fang Liu","Shuyuan Yang"],"pdf_url":"https://arxiv.org/pdf/2401.10510v1.pdf","comment":"A perspective article under review"},{"id":"http://arxiv.org/abs/2401.10506v1","updated":"2024-01-19T05:48:07Z","published":"2024-01-19T05:48:07Z","title":"FinSQL: Model-Agnostic LLMs-based Text-to-SQL Framework for Financial\n Analysis","summary":" Text-to-SQL, which provides zero-code interface for operating relational\ndatabases, has gained much attention in financial analysis; because, financial\nprofessionals may not well-skilled in SQL programming. However, until now,\nthere is no practical Text-to-SQL benchmark dataset for financial analysis, and\nexisting Text-to-SQL methods have not considered the unique characteristics of\ndatabases in financial applications, such as commonly existing wide tables. To\naddress these issues, we collect a practical Text-to-SQL benchmark dataset and\npropose a model-agnostic Large Language Model (LLMs)-based Text-to-SQL\nframework for financial analysis. The benchmark dataset, BULL, is collected\nfrom the practical financial analysis business of Hundsun Technologies Inc.,\nincluding databases for fund, stock, and macro economy. Besides, the proposed\nLLMs-based Text-to-SQL framework, FinSQL, provides a systematic treatment for\nfinancial Text-to-SQL from the perspectives of prompt construction,\nparameter-efficient fine-tuning and output calibration. Extensive experimental\nresults on BULL demonstrate that FinSQL achieves the state-of-the-art\nText-to-SQL performance at a small cost; furthermore, FinSQL can bring up to\n36.64% performance improvement in scenarios requiring few-shot cross-database\nmodel transfer.\n","authors":["Chao Zhang","Yuren Mao","Yijiang Fan","Yu Mi","Yunjun Gao","Lu Chen","Dongfang Lou","Jinshu Lin"],"pdf_url":"https://arxiv.org/pdf/2401.10506v1.pdf","comment":"13 pages, 13 figures"},{"id":"http://arxiv.org/abs/2401.00368v2","updated":"2024-01-19T05:16:20Z","published":"2023-12-31T02:13:18Z","title":"Improving Text Embeddings with Large Language Models","summary":" In this paper, we introduce a novel and simple method for obtaining\nhigh-quality text embeddings using only synthetic data and less than 1k\ntraining steps. Unlike existing methods that often depend on multi-stage\nintermediate pre-training with billions of weakly-supervised text pairs,\nfollowed by fine-tuning with a few labeled datasets, our method does not\nrequire building complex training pipelines or relying on manually collected\ndatasets that are often constrained by task diversity and language coverage. We\nleverage proprietary LLMs to generate diverse synthetic data for hundreds of\nthousands of text embedding tasks across nearly 100 languages. We then\nfine-tune open-source decoder-only LLMs on the synthetic data using standard\ncontrastive loss. Experiments demonstrate that our method achieves strong\nperformance on highly competitive text embedding benchmarks without using any\nlabeled data. Furthermore, when fine-tuned with a mixture of synthetic and\nlabeled data, our model sets new state-of-the-art results on the BEIR and MTEB\nbenchmarks.\n","authors":["Liang Wang","Nan Yang","Xiaolong Huang","Linjun Yang","Rangan Majumder","Furu Wei"],"pdf_url":"https://arxiv.org/pdf/2401.00368v2.pdf","comment":"20 pages, 15 tables"},{"id":"http://arxiv.org/abs/2401.10491v1","updated":"2024-01-19T05:02:46Z","published":"2024-01-19T05:02:46Z","title":"Knowledge Fusion of Large Language Models","summary":" While training large language models (LLMs) from scratch can generate models\nwith distinct functionalities and strengths, it comes at significant costs and\nmay result in redundant capabilities. Alternatively, a cost-effective and\ncompelling approach is to merge existing pre-trained LLMs into a more potent\nmodel. However, due to the varying architectures of these LLMs, directly\nblending their weights is impractical. In this paper, we introduce the notion\nof knowledge fusion for LLMs, aimed at combining the capabilities of existing\nLLMs and transferring them into a single LLM. By leveraging the generative\ndistributions of source LLMs, we externalize their collective knowledge and\nunique strengths, thereby potentially elevating the capabilities of the target\nmodel beyond those of any individual source LLM. We validate our approach using\nthree popular LLMs with different architectures--Llama-2, MPT, and\nOpenLLaMA--across various benchmarks and tasks. Our findings confirm that the\nfusion of LLMs can improve the performance of the target model across a range\nof capabilities such as reasoning, commonsense, and code generation. Our code,\nmodel weights, and data are public at\n\\url{https://github.com/fanqiwan/FuseLLM}.\n","authors":["Fanqi Wan","Xinting Huang","Deng Cai","Xiaojun Quan","Wei Bi","Shuming Shi"],"pdf_url":"https://arxiv.org/pdf/2401.10491v1.pdf","comment":"Accepted to ICLR 2024"},{"id":"http://arxiv.org/abs/2401.09972v2","updated":"2024-01-19T04:29:42Z","published":"2024-01-18T13:41:08Z","title":"Better Explain Transformers by Illuminating Important Information","summary":" Transformer-based models excel in various natural language processing (NLP)\ntasks, attracting countless efforts to explain their inner workings. Prior\nmethods explain Transformers by focusing on the raw gradient and attention as\ntoken attribution scores, where non-relevant information is often considered\nduring explanation computation, resulting in confusing results. In this work,\nwe propose highlighting the important information and eliminating irrelevant\ninformation by a refined information flow on top of the layer-wise relevance\npropagation (LRP) method. Specifically, we consider identifying syntactic and\npositional heads as important attention heads and focus on the relevance\nobtained from these important heads. Experimental results demonstrate that\nirrelevant information does distort output attribution scores and then should\nbe masked during explanation computation. Compared to eight baselines on both\nclassification and question-answering datasets, our method consistently\noutperforms with over 3\\% to 33\\% improvement on explanation metrics, providing\nsuperior explanation performance. Our anonymous code repository is available\nat: https://github.com/LinxinS97/Mask-LRP\n","authors":["Linxin Song","Yan Cui","Ao Luo","Freddy Lecue","Irene Li"],"pdf_url":"https://arxiv.org/pdf/2401.09972v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10487v1","updated":"2024-01-19T04:24:07Z","published":"2024-01-19T04:24:07Z","title":"Generative Dense Retrieval: Memory Can Be a Burden","summary":" Generative Retrieval (GR), autoregressively decoding relevant document\nidentifiers given a query, has been shown to perform well under the setting of\nsmall-scale corpora. By memorizing the document corpus with model parameters,\nGR implicitly achieves deep interaction between query and document. However,\nsuch a memorizing mechanism faces three drawbacks: (1) Poor memory accuracy for\nfine-grained features of documents; (2) Memory confusion gets worse as the\ncorpus size increases; (3) Huge memory update costs for new documents. To\nalleviate these problems, we propose the Generative Dense Retrieval (GDR)\nparadigm. Specifically, GDR first uses the limited memory volume to achieve\ninter-cluster matching from query to relevant document clusters.\nMemorizing-free matching mechanism from Dense Retrieval (DR) is then introduced\nto conduct fine-grained intra-cluster matching from clusters to relevant\ndocuments. The coarse-to-fine process maximizes the advantages of GR's deep\ninteraction and DR's scalability. Besides, we design a cluster identifier\nconstructing strategy to facilitate corpus memory and a cluster-adaptive\nnegative sampling strategy to enhance the intra-cluster mapping ability.\nEmpirical results show that GDR obtains an average of 3.0 R@100 improvement on\nNQ dataset under multiple settings and has better scalability.\n","authors":["Peiwen Yuan","Xinglin Wang","Shaoxiong Feng","Boyuan Pan","Yiwei Li","Heda Wang","Xupeng Miao","Kan Li"],"pdf_url":"https://arxiv.org/pdf/2401.10487v1.pdf","comment":"EACL 2024 main"},{"id":"http://arxiv.org/abs/2401.10480v1","updated":"2024-01-19T04:03:59Z","published":"2024-01-19T04:03:59Z","title":"Escape Sky-high Cost: Early-stopping Self-Consistency for Multi-step\n Reasoning","summary":" Self-consistency (SC) has been a widely used decoding strategy for\nchain-of-thought reasoning. Despite bringing significant performance\nimprovements across a variety of multi-step reasoning tasks, it is a high-cost\nmethod that requires multiple sampling with the preset size. In this paper, we\npropose a simple and scalable sampling process, \\textbf{E}arly-Stopping\n\\textbf{S}elf-\\textbf{C}onsistency (ESC), to greatly reduce the cost of SC\nwithout sacrificing performance. On this basis, one control scheme for ESC is\nfurther derivated to dynamically choose the performance-cost balance for\ndifferent tasks and models. To demonstrate ESC's effectiveness, we conducted\nextensive experiments on three popular categories of reasoning tasks:\narithmetic, commonsense and symbolic reasoning over language models with\nvarying scales. The empirical results show that ESC reduces the average number\nof sampling of chain-of-thought reasoning by a significant margin on six\nbenchmarks, including MATH (-33.8%), GSM8K (-80.1%), StrategyQA (-76.8%),\nCommonsenseQA (-78.5%), Coin Flip (-84.2%) and Last Letters (-67.4%), while\nattaining comparable performances.\n","authors":["Yiwei Li","Peiwen Yuan","Shaoxiong Feng","Boyuan Pan","Xinglin Wang","Bin Sun","Heda Wang","Kan Li"],"pdf_url":"https://arxiv.org/pdf/2401.10480v1.pdf","comment":"ICLR 2024"},{"id":"http://arxiv.org/abs/2401.10472v1","updated":"2024-01-19T03:49:28Z","published":"2024-01-19T03:49:28Z","title":"Name Tagging Under Domain Shift via Metric Learning for Life Sciences","summary":" Name tagging is a key component of Information Extraction (IE), particularly\nin scientific domains such as biomedicine and chemistry, where large language\nmodels (LLMs), e.g., ChatGPT, fall short. We investigate the applicability of\ntransfer learning for enhancing a name tagging model trained in the biomedical\ndomain (the source domain) to be used in the chemical domain (the target\ndomain). A common practice for training such a model in a few-shot learning\nsetting is to pretrain the model on the labeled source data, and then, to\nfinetune it on a hand-full of labeled target examples. In our experiments we\nobserved that such a model is prone to mis-labeling the source entities, which\ncan often appear in the text, as the target entities. To alleviate this\nproblem, we propose a model to transfer the knowledge from the source domain to\nthe target domain, however, at the same time, to project the source entities\nand target entities into separate regions of the feature space. This diminishes\nthe risk of mis-labeling the source entities as the target entities. Our model\nconsists of two stages: 1) entity grouping in the source domain, which\nincorporates knowledge from annotated events to establish relations between\nentities, and 2) entity discrimination in the target domain, which relies on\npseudo labeling and contrastive learning to enhance discrimination between the\nentities in the two domains. We carry out our extensive experiments across\nthree source and three target datasets, and demonstrate that our method\noutperforms the baselines, in some scenarios by 5\\% absolute value.\n","authors":["Hongyi Liu","Qingyun Wang","Payam Karisani","Heng Ji"],"pdf_url":"https://arxiv.org/pdf/2401.10472v1.pdf","comment":"19 pages"},{"id":"http://arxiv.org/abs/2401.10471v1","updated":"2024-01-19T03:48:27Z","published":"2024-01-19T03:48:27Z","title":"DeepEdit: Knowledge Editing as Decoding with Constraints","summary":" We develop a new perspective of knowledge editing for large language models\n(LLMs) as decoding with constraints. We propose DeepEdit (Depth-first Search\nbased Progressive Decoding for Knowledge Editing), a neuro-symbolic method that\nimproves knowledge editing with better coherence of reasoning, relevance to the\nquestion, and awareness of updated knowledge. DeepEdit can be flexibly applied\nto all black-box LLMs: it does not require any access to the model parameters,\nrepresentations, or output vocabulary distributions. DeepEdit progressively\nproduces the high-quality reasoning steps towards effective knowledge editing.\nIt utilizes a depth-first search to revise the LLMs' output, which improves the\noutput's informativeness to the input question and awareness of the updated\nknowledge. Qualitatively, DeepEdit effectively controls LLMs to produce more\nsuccinct reasoning in accord with knowledge editing. Quantitatively, DeepEdit\nyields significant gains on MQuaKE, a challenging multi-hop question-answering\ndataset with knowledge editing. We release the source code at\nhttps://github.com/wangywUST/DeepEdit.\n","authors":["Yiwei Wang","Muhao Chen","Nanyun Peng","Kai-Wei Chang"],"pdf_url":"https://arxiv.org/pdf/2401.10471v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10465v1","updated":"2024-01-19T03:37:27Z","published":"2024-01-19T03:37:27Z","title":"Data-driven grapheme-to-phoneme representations for a lexicon-free\n text-to-speech","summary":" Grapheme-to-Phoneme (G2P) is an essential first step in any modern,\nhigh-quality Text-to-Speech (TTS) system. Most of the current G2P systems rely\non carefully hand-crafted lexicons developed by experts. This poses a two-fold\nproblem. Firstly, the lexicons are generated using a fixed phoneme set,\nusually, ARPABET or IPA, which might not be the most optimal way to represent\nphonemes for all languages. Secondly, the man-hours required to produce such an\nexpert lexicon are very high. In this paper, we eliminate both of these issues\nby using recent advances in self-supervised learning to obtain data-driven\nphoneme representations instead of fixed representations. We compare our\nlexicon-free approach against strong baselines that utilize a well-crafted\nlexicon. Furthermore, we show that our data-driven lexicon-free method performs\nas good or even marginally better than the conventional rule-based or\nlexicon-based neural G2Ps in terms of Mean Opinion Score (MOS) while using no\nprior language lexicon or phoneme set, i.e. no linguistic expertise.\n","authors":["Abhinav Garg","Jiyeon Kim","Sushil Khyalia","Chanwoo Kim","Dhananjaya Gowda"],"pdf_url":"https://arxiv.org/pdf/2401.10465v1.pdf","comment":"Accepted at ICASSP 2024"},{"id":"http://arxiv.org/abs/2401.10463v1","updated":"2024-01-19T03:24:36Z","published":"2024-01-19T03:24:36Z","title":"Critical Data Size of Language Models from a Grokking Perspective","summary":" We explore the critical data size in language models, a threshold that marks\na fundamental shift from quick memorization to slow generalization. We\nformalize the phase transition under the grokking configuration into the Data\nEfficiency Hypothesis and identify data insufficiency, sufficiency, and surplus\nregimes in language models training dynamics. We develop a grokking\nconfiguration to reproduce grokking on simplistic language models stably by\nrescaling initialization and weight decay. We show that generalization occurs\nonly when language models reach a critical size. We analyze grokking across\nsample-wise and model-wise, verifying the proposed data efficiency hypothesis.\nOur experiments reveal smoother phase transitions occurring at the critical\ndataset size for language datasets. As the model size increases, this critical\npoint also becomes larger, indicating that larger models require more data. Our\nresults deepen the understanding of language model training, offering a novel\nperspective on the role of data in the learning mechanism of language models.\n","authors":["Xuekai Zhu","Yao Fu","Bowen Zhou","Zhouhan Lin"],"pdf_url":"https://arxiv.org/pdf/2401.10463v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03279v2","updated":"2024-01-19T02:26:38Z","published":"2023-08-07T03:39:52Z","title":"UniversalNER: Targeted Distillation from Large Language Models for Open\n Named Entity Recognition","summary":" Large language models (LLMs) have demonstrated remarkable generalizability,\nsuch as understanding arbitrary entities and relations. Instruction tuning has\nproven effective for distilling LLMs into more cost-efficient models such as\nAlpaca and Vicuna. Yet such student models still trail the original LLMs by\nlarge margins in downstream applications. In this paper, we explore targeted\ndistillation with mission-focused instruction tuning to train student models\nthat can excel in a broad application class such as open information\nextraction. Using named entity recognition (NER) for case study, we show how\nChatGPT can be distilled into much smaller UniversalNER models for open NER.\nFor evaluation, we assemble the largest NER benchmark to date, comprising 43\ndatasets across 9 diverse domains such as biomedicine, programming, social\nmedia, law, finance. Without using any direct supervision, UniversalNER attains\nremarkable NER accuracy across tens of thousands of entity types, outperforming\ngeneral instruction-tuned models such as Alpaca and Vicuna by over 30 absolute\nF1 points in average. With a tiny fraction of parameters, UniversalNER not only\nacquires ChatGPT's capability in recognizing arbitrary entity types, but also\noutperforms its NER accuracy by 7-9 absolute F1 points in average. Remarkably,\nUniversalNER even outperforms by a large margin state-of-the-art multi-task\ninstruction-tuned systems such as InstructUIE, which uses supervised NER\nexamples. We also conduct thorough ablation studies to assess the impact of\nvarious components in our distillation approach. We release the distillation\nrecipe, data, and UniversalNER models to facilitate future research on targeted\ndistillation.\n","authors":["Wenxuan Zhou","Sheng Zhang","Yu Gu","Muhao Chen","Hoifung Poon"],"pdf_url":"https://arxiv.org/pdf/2308.03279v2.pdf","comment":"Accepted at ICLR 2024. Project page: https://universal-ner.github.io/"},{"id":"http://arxiv.org/abs/2401.10449v1","updated":"2024-01-19T01:36:07Z","published":"2024-01-19T01:36:07Z","title":"Contextualized Automatic Speech Recognition with Attention-Based Bias\n Phrase Boosted Beam Search","summary":" End-to-end (E2E) automatic speech recognition (ASR) methods exhibit\nremarkable performance. However, since the performance of such methods is\nintrinsically linked to the context present in the training data, E2E-ASR\nmethods do not perform as desired for unseen user contexts (e.g., technical\nterms, personal names, and playlists). Thus, E2E-ASR methods must be easily\ncontextualized by the user or developer. This paper proposes an attention-based\ncontextual biasing method that can be customized using an editable phrase list\n(referred to as a bias list). The proposed method can be trained effectively by\ncombining a bias phrase index loss and special tokens to detect the bias\nphrases in the input speech data. In addition, to improve the contextualization\nperformance during inference further, we propose a bias phrase boosted (BPB)\nbeam search algorithm based on the bias phrase index probability. Experimental\nresults demonstrate that the proposed method consistently improves the word\nerror rate and the character error rate of the target phrases in the bias list\non both the Librispeech-960 (English) and our in-house (Japanese) dataset,\nrespectively.\n","authors":["Yui Sudo","Muhammad Shakeel","Yosuke Fukumoto","Yifan Peng","Shinji Watanabe"],"pdf_url":"https://arxiv.org/pdf/2401.10449v1.pdf","comment":"accepted by ICASSP20224"},{"id":"http://arxiv.org/abs/2401.10447v1","updated":"2024-01-19T01:30:16Z","published":"2024-01-19T01:30:16Z","title":"Investigating Training Strategies and Model Robustness of Low-Rank\n Adaptation for Language Modeling in Speech Recognition","summary":" The use of low-rank adaptation (LoRA) with frozen pretrained language models\n(PLMs) has become increasing popular as a mainstream, resource-efficient\nmodeling approach for memory-constrained hardware. In this study, we first\nexplore how to enhance model performance by introducing various LoRA training\nstrategies, achieving relative word error rate reductions of 3.50\\% on the\npublic Librispeech dataset and of 3.67\\% on an internal dataset in the\nmessaging domain. To further characterize the stability of LoRA-based\nsecond-pass speech recognition models, we examine robustness against input\nperturbations. These perturbations are rooted in homophone replacements and a\nnovel metric called N-best Perturbation-based Rescoring Robustness (NPRR), both\ndesigned to measure the relative degradation in the performance of rescoring\nmodels. Our experimental results indicate that while advanced variants of LoRA,\nsuch as dynamic rank-allocated LoRA, lead to performance degradation in\n$1$-best perturbation, they alleviate the degradation in $N$-best perturbation.\nThis finding is in comparison to fully-tuned models and vanilla LoRA tuning\nbaselines, suggesting that a comprehensive selection is needed when using\nLoRA-based adaptation for compute-cost savings and robust language modeling.\n","authors":["Yu Yu","Chao-Han Huck Yang","Tuan Dinh","Sungho Ryu","Jari Kolehmainen","Roger Ren","Denis Filimonov","Prashanth G. Shivakumar","Ankur Gandhe","Ariya Rastow","Jia Xu","Ivan Bulyko","Andreas Stolcke"],"pdf_url":"https://arxiv.org/pdf/2401.10447v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10446v1","updated":"2024-01-19T01:29:27Z","published":"2024-01-19T01:29:27Z","title":"Large Language Models are Efficient Learners of Noise-Robust Speech\n Recognition","summary":" Recent advances in large language models (LLMs) have promoted generative\nerror correction (GER) for automatic speech recognition (ASR), which leverages\nthe rich linguistic knowledge and powerful reasoning ability of LLMs to improve\nrecognition results. The latest work proposes a GER benchmark with HyPoradise\ndataset to learn the mapping from ASR N-best hypotheses to ground-truth\ntranscription by efficient LLM finetuning, which shows great effectiveness but\nlacks specificity on noise-robust ASR. In this work, we extend the benchmark to\nnoisy conditions and investigate if we can teach LLMs to perform denoising for\nGER just like what robust ASR do}, where one solution is introducing noise\ninformation as a conditioner into LLM. However, directly incorporating noise\nembeddings from audio encoder could harm the LLM tuning due to cross-modality\ngap. To this end, we propose to extract a language-space noise embedding from\nthe N-best list to represent the noise conditions of source speech, which can\npromote the denoising process in GER. Furthermore, in order to enhance its\nrepresentation ability of audio noise, we design a knowledge distillation (KD)\napproach via mutual information estimation to distill the real noise\ninformation in audio embeddings to our language embedding. Experiments on\nvarious latest LLMs demonstrate our approach achieves a new breakthrough with\nup to 53.9% correction improvement in terms of word error rate while with\nlimited training data. Analysis shows that our language-space noise embedding\ncan well represent the noise conditions of source speech, under which\noff-the-shelf LLMs show strong ability of language-space denoising.\n","authors":["Yuchen Hu","Chen Chen","Chao-Han Huck Yang","Ruizhe Li","Chao Zhang","Pin-Yu Chen","EnSiong Chng"],"pdf_url":"https://arxiv.org/pdf/2401.10446v1.pdf","comment":"Accepted to ICLR 2024, Spotlight top 5%, 24 pages. This work will be\n open sourced at: https://github.com/YUCHEN005/RobustGER under MIT license"},{"id":"http://arxiv.org/abs/2401.10440v1","updated":"2024-01-19T01:07:50Z","published":"2024-01-19T01:07:50Z","title":"Breaking the Curse of Multilinguality with Cross-lingual Expert Language\n Models","summary":" Despite their popularity in non-English NLP, multilingual language models\noften underperform monolingual ones due to inter-language competition for model\nparameters. We propose Cross-lingual Expert Language Models (X-ELM), which\nmitigate this competition by independently training language models on subsets\nof the multilingual corpus. This process specializes X-ELMs to different\nlanguages while remaining effective as a multilingual ensemble. Our experiments\nshow that when given the same compute budget, X-ELM outperforms jointly trained\nmultilingual models across all considered languages and that these gains\ntransfer to downstream tasks. X-ELM provides additional benefits over\nperformance improvements: new experts can be iteratively added, adapting X-ELM\nto new languages without catastrophic forgetting. Furthermore, training is\nasynchronous, reducing the hardware requirements for multilingual training and\ndemocratizing multilingual modeling.\n","authors":["Terra Blevins","Tomasz Limisiewicz","Suchin Gururangan","Margaret Li","Hila Gonen","Noah A. Smith","Luke Zettlemoyer"],"pdf_url":"https://arxiv.org/pdf/2401.10440v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.04398v2","updated":"2024-01-19T01:05:05Z","published":"2024-01-09T07:46:26Z","title":"Chain-of-Table: Evolving Tables in the Reasoning Chain for Table\n Understanding","summary":" Table-based reasoning with large language models (LLMs) is a promising\ndirection to tackle many table understanding tasks, such as table-based\nquestion answering and fact verification. Compared with generic reasoning,\ntable-based reasoning requires the extraction of underlying semantics from both\nfree-form questions and semi-structured tabular data. Chain-of-Thought and its\nsimilar approaches incorporate the reasoning chain in the form of textual\ncontext, but it is still an open question how to effectively leverage tabular\ndata in the reasoning chain. We propose the Chain-of-Table framework, where\ntabular data is explicitly used in the reasoning chain as a proxy for\nintermediate thoughts. Specifically, we guide LLMs using in-context learning to\niteratively generate operations and update the table to represent a tabular\nreasoning chain. LLMs can therefore dynamically plan the next operation based\non the results of the previous ones. This continuous evolution of the table\nforms a chain, showing the reasoning process for a given tabular problem. The\nchain carries structured information of the intermediate results, enabling more\naccurate and reliable predictions. Chain-of-Table achieves new state-of-the-art\nperformance on WikiTQ, FeTaQA, and TabFact benchmarks across multiple LLM\nchoices.\n","authors":["Zilong Wang","Hao Zhang","Chun-Liang Li","Julian Martin Eisenschlos","Vincent Perot","Zifeng Wang","Lesly Miculicich","Yasuhisa Fujii","Jingbo Shang","Chen-Yu Lee","Tomas Pfister"],"pdf_url":"https://arxiv.org/pdf/2401.04398v2.pdf","comment":"Accepted to ICLR 2024"},{"id":"http://arxiv.org/abs/2401.11052v1","updated":"2024-01-19T23:00:31Z","published":"2024-01-19T23:00:31Z","title":"Mining experimental data from Materials Science literature with Large\n Language Models","summary":" This study is dedicated to evaluating the capabilities of advanced large\nlanguage models (LLMs) such as GPT-3.5-Turbo, GPT-4, and GPT-4-Turbo in the\nextraction of structured information from scientific documents within the field\nof materials science. We introduce a novel methodology for the comparative\nanalysis of intricate material expressions, emphasising the standardisation of\nchemical formulas to tackle the complexities inherent in materials science\ninformation assessment. To this end, we primarily focus on two critical tasks\nof information extraction: (i) a named entity recognition (NER) of studied\nmaterials and physical properties and (ii) a relation extraction (RE) between\nthese entities. The performance of LLMs in executing these tasks is benchmarked\nagainst traditional models based on the BERT architecture and rule-based\napproaches. For NER, LLMs fail to outperform the baseline with zero-shot\nprompting and exhibit only limited improvement with few-shot prompting.\nHowever, for RE, a GPT-3.5-Turbo fine-tuned with the appropriate strategy\noutperforms all models, including the baseline. Without any fine-tuning, GPT-4\nand GPT-4-Turbo display remarkable reasoning and relationship extraction\ncapabilities after being provided with merely a couple of examples, surpassing\nthe baseline. Overall, the results suggest that although LLMs demonstrate\nrelevant reasoning skills in connecting concepts, for tasks requiring\nextracting complex domain-specific entities like materials, specialised models\nare currently a better choice.\n","authors":["Luca Foppiano","Guillaume Lambard","Toshiyuki Amagasa","Masashi Ishii"],"pdf_url":"https://arxiv.org/pdf/2401.11052v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11048v1","updated":"2024-01-19T22:24:39Z","published":"2024-01-19T22:24:39Z","title":"PubTator 3.0: an AI-powered Literature Resource for Unlocking Biomedical\n Knowledge","summary":" PubTator 3.0 (https://www.ncbi.nlm.nih.gov/research/pubtator3/) is a\nbiomedical literature resource using state-of-the-art AI techniques to offer\nsemantic and relation searches for key concepts like proteins, genetic\nvariants, diseases, and chemicals. It currently provides over one billion\nentity and relation annotations across approximately 36 million PubMed\nabstracts and 6 million full-text articles from the PMC open access subset,\nupdated weekly. PubTator 3.0's online interface and API utilize these\nprecomputed entity relations and synonyms to provide advanced search\ncapabilities and enable large-scale analyses, streamlining many complex\ninformation needs. We showcase the retrieval quality of PubTator 3.0 using a\nseries of entity pair queries, demonstrating that PubTator 3.0 retrieves a\ngreater number of articles than either PubMed or Google Scholar, with higher\nprecision in the top 20 results. We further show that integrating ChatGPT\n(GPT-4) with PubTator APIs dramatically improves the factuality and\nverifiability of its responses. In summary, PubTator 3.0 offers a comprehensive\nset of features and tools that allow researchers to navigate the ever-expanding\nwealth of biomedical literature, expediting research and unlocking valuable\ninsights for scientific discovery.\n","authors":["Chih-Hsuan Wei","Alexis Allot","Po-Ting Lai","Robert Leaman","Shubo Tian","Ling Luo","Qiao Jin","Zhizheng Wang","Qingyu Chen","Zhiyong Lu"],"pdf_url":"https://arxiv.org/pdf/2401.11048v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11033v1","updated":"2024-01-19T21:21:02Z","published":"2024-01-19T21:21:02Z","title":"FAIR Enough: How Can We Develop and Assess a FAIR-Compliant Dataset for\n Large Language Models' Training?","summary":" Advancements in Large Language Models (LLMs) highlight the need for ethical\npractices and data integrity. We introduce a framework that embeds FAIR\n(Findable, Accessible, Interoperable, Reusable) data principles into LLM\ntraining. This approach marks a shift towards practices compliant with FAIR\nstandards. Our framework presents guidelines for integrating FAIR data\nprinciples into LLM training. This initiative includes a checklist for\nresearchers and developers. We also demonstrate its practical application\nthrough a case study focused on bias identification and mitigation in our\nFAIR-compliant dataset. This work is a significant contribution to AI ethics\nand data science, advocating for balanced and ethical training methods in LLMs.\n","authors":["Shaina Raza","Shardul Ghuge","Chen Ding","Deval Pandya"],"pdf_url":"https://arxiv.org/pdf/2401.11033v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11021v1","updated":"2024-01-19T20:40:23Z","published":"2024-01-19T20:40:23Z","title":"Analysis and Detection of Multilingual Hate Speech Using Transformer\n Based Deep Learning","summary":" Hate speech is harmful content that directly attacks or promotes hatred\nagainst members of groups or individuals based on actual or perceived aspects\nof identity, such as racism, religion, or sexual orientation. This can affect\nsocial life on social media platforms as hateful content shared through social\nmedia can harm both individuals and communities. As the prevalence of hate\nspeech increases online, the demand for automated detection as an NLP task is\nincreasing. In this work, the proposed method is using transformer-based model\nto detect hate speech in social media, like twitter, Facebook, WhatsApp,\nInstagram, etc. The proposed model is independent of languages and has been\ntested on Italian, English, German, Bengali. The Gold standard datasets were\ncollected from renowned researcher Zeerak Talat, Sara Tonelli, Melanie Siegel,\nand Rezaul Karim. The success rate of the proposed model for hate speech\ndetection is higher than the existing baseline and state-of-the-art models with\naccuracy in Bengali dataset is 89%, in English: 91%, in German dataset 91% and\nin Italian dataset it is 77%. The proposed algorithm shows substantial\nimprovement to the benchmark method.\n","authors":["Arijit Das","Somashree Nandy","Rupam Saha","Srijan Das","Diganta Saha"],"pdf_url":"https://arxiv.org/pdf/2401.11021v1.pdf","comment":"20 pages"},{"id":"http://arxiv.org/abs/2401.10995v1","updated":"2024-01-19T19:23:37Z","published":"2024-01-19T19:23:37Z","title":"The Radiation Oncology NLP Database","summary":" We present the Radiation Oncology NLP Database (ROND), the first dedicated\nNatural Language Processing (NLP) dataset for radiation oncology, an important\nmedical specialty that has received limited attention from the NLP community in\nthe past. With the advent of Artificial General Intelligence (AGI), there is an\nincreasing need for specialized datasets and benchmarks to facilitate research\nand development. ROND is specifically designed to address this gap in the\ndomain of radiation oncology, a field that offers many opportunities for NLP\nexploration. It encompasses various NLP tasks including Logic Reasoning, Text\nClassification, Named Entity Recognition (NER), Question Answering (QA), Text\nSummarization, and Patient-Clinician Conversations, each with a distinct focus\non radiation oncology concepts and application cases. In addition, we have\ndeveloped an instruction-tuning dataset consisting of over 20k instruction\npairs (based on ROND) and trained a large language model, CancerChat. This\nserves to demonstrate the potential of instruction-tuning large language models\nwithin a highly-specialized medical domain. The evaluation results in this\nstudy could serve as baseline results for future research. ROND aims to\nstimulate advancements in radiation oncology and clinical NLP by offering a\nplatform for testing and improving algorithms and models in a domain-specific\ncontext. The ROND dataset is a joint effort of multiple U.S. health\ninstitutions. The data is available at\nhttps://github.com/zl-liu/Radiation-Oncology-NLP-Database.\n","authors":["Zhengliang Liu","Jason Holmes","Wenxiong Liao","Chenbin Liu","Lian Zhang","Hongying Feng","Peilong Wang","Muhammad Ali Elahi","Hongmin Cai","Lichao Sun","Quanzheng Li","Xiang Li","Tianming Liu","Jiajian Shen","Wei Liu"],"pdf_url":"https://arxiv.org/pdf/2401.10995v1.pdf","comment":"10 pages, 7 figures, 6 tables"}],"Computer Vision and Pattern Recognition":[{"id":"http://arxiv.org/abs/2401.10891v1","updated":"2024-01-19T18:59:52Z","published":"2024-01-19T18:59:52Z","title":"Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data","summary":" This work presents Depth Anything, a highly practical solution for robust\nmonocular depth estimation. Without pursuing novel technical modules, we aim to\nbuild a simple yet powerful foundation model dealing with any images under any\ncircumstances. To this end, we scale up the dataset by designing a data engine\nto collect and automatically annotate large-scale unlabeled data (~62M), which\nsignificantly enlarges the data coverage and thus is able to reduce the\ngeneralization error. We investigate two simple yet effective strategies that\nmake data scaling-up promising. First, a more challenging optimization target\nis created by leveraging data augmentation tools. It compels the model to\nactively seek extra visual knowledge and acquire robust representations.\nSecond, an auxiliary supervision is developed to enforce the model to inherit\nrich semantic priors from pre-trained encoders. We evaluate its zero-shot\ncapabilities extensively, including six public datasets and randomly captured\nphotos. It demonstrates impressive generalization ability. Further, through\nfine-tuning it with metric depth information from NYUv2 and KITTI, new SOTAs\nare set. Our better depth model also results in a better depth-conditioned\nControlNet. Our models are released at\nhttps://github.com/LiheYoung/Depth-Anything.\n","authors":["Lihe Yang","Bingyi Kang","Zilong Huang","Xiaogang Xu","Jiashi Feng","Hengshuang Zhao"],"pdf_url":"https://arxiv.org/pdf/2401.10891v1.pdf","comment":"Project page: https://depth-anything.github.io"},{"id":"http://arxiv.org/abs/2401.10890v1","updated":"2024-01-19T18:59:37Z","published":"2024-01-19T18:59:37Z","title":"Event detection from novel data sources: Leveraging satellite imagery\n alongside GPS traces","summary":" Rapid identification and response to breaking events, particularly those that\npose a threat to human life such as natural disasters or conflicts, is of\nparamount importance. The prevalence of mobile devices and the ubiquity of\nnetwork connectivity has generated a massive amount of temporally- and\nspatially-stamped data. Numerous studies have used mobile data to derive\nindividual human mobility patterns for various applications. Similarly, the\nincreasing number of orbital satellites has made it easier to gather\nhigh-resolution images capturing a snapshot of a geographical area in sub-daily\ntemporal frequency. We propose a novel data fusion methodology integrating\nsatellite imagery with privacy-enhanced mobile data to augment the event\ninference task, whether in real-time or historical. In the absence of boots on\nthe ground, mobile data is able to give an approximation of human mobility,\nproximity to one another, and the built environment. On the other hand,\nsatellite imagery can provide visual information on physical changes to the\nbuilt and natural environment. The expected use cases for our methodology\ninclude small-scale disaster detection (i.e., tornadoes, wildfires, and floods)\nin rural regions, search and rescue operation augmentation for lost hikers in\nremote wilderness areas, and identification of active conflict areas and\npopulation displacement in war-torn states. Our implementation is open-source\non GitHub: https://github.com/ekinugurel/SatMobFusion.\n","authors":["Ekin Ugurel","Steffen Coenen","Minda Zhou Chen","Cynthia Chen"],"pdf_url":"https://arxiv.org/pdf/2401.10890v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10889v1","updated":"2024-01-19T18:59:11Z","published":"2024-01-19T18:59:11Z","title":"Synthesizing Moving People with 3D Control","summary":" In this paper, we present a diffusion model-based framework for animating\npeople from a single image for a given target 3D motion sequence. Our approach\nhas two core components: a) learning priors about invisible parts of the human\nbody and clothing, and b) rendering novel body poses with proper clothing and\ntexture. For the first part, we learn an in-filling diffusion model to\nhallucinate unseen parts of a person given a single image. We train this model\non texture map space, which makes it more sample-efficient since it is\ninvariant to pose and viewpoint. Second, we develop a diffusion-based rendering\npipeline, which is controlled by 3D human poses. This produces realistic\nrenderings of novel poses of the person, including clothing, hair, and\nplausible in-filling of unseen regions. This disentangled approach allows our\nmethod to generate a sequence of images that are faithful to the target motion\nin the 3D pose and, to the input image in terms of visual similarity. In\naddition to that, the 3D control allows various synthetic camera trajectories\nto render a person. Our experiments show that our method is resilient in\ngenerating prolonged motions and varied challenging and complex poses compared\nto prior methods. Please check our website for more details:\nhttps://boyiliee.github.io/3DHM.github.io/.\n","authors":["Boyi Li","Jathushan Rajasegaran","Yossi Gandelsman","Alexei A. Efros","Jitendra Malik"],"pdf_url":"https://arxiv.org/pdf/2401.10889v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10886v1","updated":"2024-01-19T18:57:46Z","published":"2024-01-19T18:57:46Z","title":"SCENES: Subpixel Correspondence Estimation With Epipolar Supervision","summary":" Extracting point correspondences from two or more views of a scene is a\nfundamental computer vision problem with particular importance for relative\ncamera pose estimation and structure-from-motion. Existing local feature\nmatching approaches, trained with correspondence supervision on large-scale\ndatasets, obtain highly-accurate matches on the test sets. However, they do not\ngeneralise well to new datasets with different characteristics to those they\nwere trained on, unlike classic feature extractors. Instead, they require\nfinetuning, which assumes that ground-truth correspondences or ground-truth\ncamera poses and 3D structure are available. We relax this assumption by\nremoving the requirement of 3D structure, e.g., depth maps or point clouds, and\nonly require camera pose information, which can be obtained from odometry. We\ndo so by replacing correspondence losses with epipolar losses, which encourage\nputative matches to lie on the associated epipolar line. While weaker than\ncorrespondence supervision, we observe that this cue is sufficient for\nfinetuning existing models on new data. We then further relax the assumption of\nknown camera poses by using pose estimates in a novel bootstrapping approach.\nWe evaluate on highly challenging datasets, including an indoor drone dataset\nand an outdoor smartphone camera dataset, and obtain state-of-the-art results\nwithout strong supervision.\n","authors":["Dominik A. Kloepfer","João F. Henriques","Dylan Campbell"],"pdf_url":"https://arxiv.org/pdf/2401.10886v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.20685v2","updated":"2024-01-19T18:53:13Z","published":"2023-10-31T17:49:48Z","title":"NeRF Revisited: Fixing Quadrature Instability in Volume Rendering","summary":" Neural radiance fields (NeRF) rely on volume rendering to synthesize novel\nviews. Volume rendering requires evaluating an integral along each ray, which\nis numerically approximated with a finite sum that corresponds to the exact\nintegral along the ray under piecewise constant volume density. As a\nconsequence, the rendered result is unstable w.r.t. the choice of samples along\nthe ray, a phenomenon that we dub quadrature instability. We propose a\nmathematically principled solution by reformulating the sample-based rendering\nequation so that it corresponds to the exact integral under piecewise linear\nvolume density. This simultaneously resolves multiple issues: conflicts between\nsamples along different rays, imprecise hierarchical sampling, and\nnon-differentiability of quantiles of ray termination distances w.r.t. model\nparameters. We demonstrate several benefits over the classical sample-based\nrendering equation, such as sharper textures, better geometric reconstruction,\nand stronger depth supervision. Our proposed formulation can be also be used as\na drop-in replacement to the volume rendering equation of existing NeRF-based\nmethods. Our project page can be found at pl-nerf.github.io.\n","authors":["Mikaela Angelina Uy","Kiyohiro Nakayama","Guandao Yang","Rahul Krishna Thomas","Leonidas Guibas","Ke Li"],"pdf_url":"https://arxiv.org/pdf/2310.20685v2.pdf","comment":"Neurips 2023"},{"id":"http://arxiv.org/abs/2401.10877v1","updated":"2024-01-19T18:41:53Z","published":"2024-01-19T18:41:53Z","title":"The Cadaver in the Machine: The Social Practices of Measurement and\n Validation in Motion Capture Technology","summary":" Motion capture systems, used across various domains, make body\nrepresentations concrete through technical processes. We argue that the\nmeasurement of bodies and the validation of measurements for motion capture\nsystems can be understood as social practices. By analyzing the findings of a\nsystematic literature review (N=278) through the lens of social practice\ntheory, we show how these practices, and their varying attention to errors,\nbecome ingrained in motion capture design and innovation over time. Moreover,\nwe show how contemporary motion capture systems perpetuate assumptions about\nhuman bodies and their movements. We suggest that social practices of\nmeasurement and validation are ubiquitous in the development of data- and\nsensor-driven systems more broadly, and provide this work as a basis for\ninvestigating hidden design assumptions and their potential negative\nconsequences in human-computer interaction.\n","authors":["Emma Harvey","Hauke Sandhaus","Abigail Z. Jacobs","Emanuel Moss","Mona Sloane"],"pdf_url":"https://arxiv.org/pdf/2401.10877v1.pdf","comment":"34 pages, 9 figures. To appear in the 2024 ACM CHI Conference on\n Human Factors in Computing Systems (CHI '24)"},{"id":"http://arxiv.org/abs/2306.08251v2","updated":"2024-01-19T18:35:54Z","published":"2023-06-14T05:34:02Z","title":"GBSD: Generative Bokeh with Stage Diffusion","summary":" The bokeh effect is an artistic technique that blurs out-of-focus areas in a\nphotograph and has gained interest due to recent developments in text-to-image\nsynthesis and the ubiquity of smart-phone cameras and photo-sharing apps. Prior\nwork on rendering bokeh effects have focused on post hoc image manipulation to\nproduce similar blurring effects in existing photographs using classical\ncomputer graphics or neural rendering techniques, but have either depth\ndiscontinuity artifacts or are restricted to reproducing bokeh effects that are\npresent in the training data. More recent diffusion based models can synthesize\nimages with an artistic style, but either require the generation of\nhigh-dimensional masks, expensive fine-tuning, or affect global image\ncharacteristics. In this paper, we present GBSD, the first generative\ntext-to-image model that synthesizes photorealistic images with a bokeh style.\nMotivated by how image synthesis occurs progressively in diffusion models, our\napproach combines latent diffusion models with a 2-stage conditioning algorithm\nto render bokeh effects on semantically defined objects. Since we can focus the\neffect on objects, this semantic bokeh effect is more versatile than classical\nrendering techniques. We evaluate GBSD both quantitatively and qualitatively\nand demonstrate its ability to be applied in both text-to-image and\nimage-to-image settings.\n","authors":["Jieren Deng","Xin Zhou","Hao Tian","Zhihong Pan","Derek Aguiar"],"pdf_url":"https://arxiv.org/pdf/2306.08251v2.pdf","comment":"Short Version is accepted by International Conference on Acoustics,\n Speech, and Signal Processing (ICASSP) 2024"},{"id":"http://arxiv.org/abs/2303.05015v2","updated":"2024-01-19T18:23:19Z","published":"2023-03-09T03:33:56Z","title":"Smooth and Stepwise Self-Distillation for Object Detection","summary":" Distilling the structured information captured in feature maps has\ncontributed to improved results for object detection tasks, but requires\ncareful selection of baseline architectures and substantial pre-training.\nSelf-distillation addresses these limitations and has recently achieved\nstate-of-the-art performance for object detection despite making several\nsimplifying architectural assumptions. Building on this work, we propose Smooth\nand Stepwise Self-Distillation (SSSD) for object detection. Our SSSD\narchitecture forms an implicit teacher from object labels and a feature pyramid\nnetwork backbone to distill label-annotated feature maps using Jensen-Shannon\ndistance, which is smoother than distillation losses used in prior work. We\nadditionally add a distillation coefficient that is adaptively configured based\non the learning rate. We extensively benchmark SSSD against a baseline and two\nstate-of-the-art object detector architectures on the COCO dataset by varying\nthe coefficients and backbone and detector networks. We demonstrate that SSSD\nachieves higher average precision in most experimental settings, is robust to a\nwide range of coefficients, and benefits from our stepwise distillation\nprocedure.\n","authors":["Jieren Deng","Xin Zhou","Hao Tian","Zhihong Pan","Derek Aguiar"],"pdf_url":"https://arxiv.org/pdf/2303.05015v2.pdf","comment":"Accepted by International Conference on Image Processing (ICIP) 2023"},{"id":"http://arxiv.org/abs/2401.10857v1","updated":"2024-01-19T18:00:52Z","published":"2024-01-19T18:00:52Z","title":"Motion Consistency Loss for Monocular Visual Odometry with\n Attention-Based Deep Learning","summary":" Deep learning algorithms have driven expressive progress in many complex\ntasks. The loss function is a core component of deep learning techniques,\nguiding the learning process of neural networks. This paper contributes by\nintroducing a consistency loss for visual odometry with deep learning-based\napproaches. The motion consistency loss explores repeated motions that appear\nin consecutive overlapped video clips. Experimental results show that our\napproach increased the performance of a model on the KITTI odometry benchmark.\n","authors":["André O. Françani","Marcos R. O. A. Maximo"],"pdf_url":"https://arxiv.org/pdf/2401.10857v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10848v1","updated":"2024-01-19T17:48:05Z","published":"2024-01-19T17:48:05Z","title":"Source-Free and Image-Only Unsupervised Domain Adaptation for Category\n Level Object Pose Estimation","summary":" We consider the problem of source-free unsupervised category-level pose\nestimation from only RGB images to a target domain without any access to source\ndomain data or 3D annotations during adaptation. Collecting and annotating\nreal-world 3D data and corresponding images is laborious, expensive, yet\nunavoidable process, since even 3D pose domain adaptation methods require 3D\ndata in the target domain. We introduce 3DUDA, a method capable of adapting to\na nuisance-ridden target domain without 3D or depth data. Our key insight stems\nfrom the observation that specific object subparts remain stable across\nout-of-domain (OOD) scenarios, enabling strategic utilization of these\ninvariant subcomponents for effective model updates. We represent object\ncategories as simple cuboid meshes, and harness a generative model of neural\nfeature activations modeled at each mesh vertex learnt using differential\nrendering. We focus on individual locally robust mesh vertex features and\niteratively update them based on their proximity to corresponding features in\nthe target domain even when the global pose is not correct. Our model is then\ntrained in an EM fashion, alternating between updating the vertex features and\nthe feature extractor. We show that our method simulates fine-tuning on a\nglobal pseudo-labeled dataset under mild assumptions, which converges to the\ntarget domain asymptotically. Through extensive empirical validation, including\na complex extreme UDA setup which combines real nuisances, synthetic noise, and\nocclusion, we demonstrate the potency of our simple approach in addressing the\ndomain shift challenge and significantly improving pose estimation accuracy.\n","authors":["Prakhar Kaushik","Aayush Mishra","Adam Kortylewski","Alan Yuille"],"pdf_url":"https://arxiv.org/pdf/2401.10848v1.pdf","comment":"36 pages, 9 figures, 50 tables; ICLR 2024 (Poster)"},{"id":"http://arxiv.org/abs/2401.10831v1","updated":"2024-01-19T17:27:21Z","published":"2024-01-19T17:27:21Z","title":"Understanding Video Transformers via Universal Concept Discovery","summary":" This paper studies the problem of concept-based interpretability of\ntransformer representations for videos. Concretely, we seek to explain the\ndecision-making process of video transformers based on high-level,\nspatiotemporal concepts that are automatically discovered. Prior research on\nconcept-based interpretability has concentrated solely on image-level tasks.\nComparatively, video models deal with the added temporal dimension, increasing\ncomplexity and posing challenges in identifying dynamic concepts over time. In\nthis work, we systematically address these challenges by introducing the first\nVideo Transformer Concept Discovery (VTCD) algorithm. To this end, we propose\nan efficient approach for unsupervised identification of units of video\ntransformer representations - concepts, and ranking their importance to the\noutput of a model. The resulting concepts are highly interpretable, revealing\nspatio-temporal reasoning mechanisms and object-centric representations in\nunstructured video models. Performing this analysis jointly over a diverse set\nof supervised and self-supervised representations, we discover that some of\nthese mechanism are universal in video transformers. Finally, we demonstrate\nthat VTCDcan be used to improve model performance for fine-grained tasks.\n","authors":["Matthew Kowal","Achal Dave","Rares Ambrus","Adrien Gaidon","Konstantinos G. Derpanis","Pavel Tokmakov"],"pdf_url":"https://arxiv.org/pdf/2401.10831v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10822v1","updated":"2024-01-19T17:16:16Z","published":"2024-01-19T17:16:16Z","title":"ActAnywhere: Subject-Aware Video Background Generation","summary":" Generating video background that tailors to foreground subject motion is an\nimportant problem for the movie industry and visual effects community. This\ntask involves synthesizing background that aligns with the motion and\nappearance of the foreground subject, while also complies with the artist's\ncreative intention. We introduce ActAnywhere, a generative model that automates\nthis process which traditionally requires tedious manual efforts. Our model\nleverages the power of large-scale video diffusion models, and is specifically\ntailored for this task. ActAnywhere takes a sequence of foreground subject\nsegmentation as input and an image that describes the desired scene as\ncondition, to produce a coherent video with realistic foreground-background\ninteractions while adhering to the condition frame. We train our model on a\nlarge-scale dataset of human-scene interaction videos. Extensive evaluations\ndemonstrate the superior performance of our model, significantly outperforming\nbaselines. Moreover, we show that ActAnywhere generalizes to diverse\nout-of-distribution samples, including non-human subjects. Please visit our\nproject webpage at https://actanywhere.github.io.\n","authors":["Boxiao Pan","Zhan Xu","Chun-Hao Paul Huang","Krishna Kumar Singh","Yang Zhou","Leonidas J. Guibas","Jimei Yang"],"pdf_url":"https://arxiv.org/pdf/2401.10822v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10815v1","updated":"2024-01-19T17:02:17Z","published":"2024-01-19T17:02:17Z","title":"RAD-DINO: Exploring Scalable Medical Image Encoders Beyond Text\n Supervision","summary":" Language-supervised pre-training has proven to be a valuable method for\nextracting semantically meaningful features from images, serving as a\nfoundational element in multimodal systems within the computer vision and\nmedical imaging domains. However, resulting features are limited by the\ninformation contained within the text. This is particularly problematic in\nmedical imaging, where radiologists' written findings focus on specific\nobservations; a challenge compounded by the scarcity of paired imaging-text\ndata due to concerns over leakage of personal health information. In this work,\nwe fundamentally challenge the prevailing reliance on language supervision for\nlearning general purpose biomedical imaging encoders. We introduce RAD-DINO, a\nbiomedical image encoder pre-trained solely on unimodal biomedical imaging data\nthat obtains similar or greater performance than state-of-the-art biomedical\nlanguage supervised models on a diverse range of benchmarks. Specifically, the\nquality of learned representations is evaluated on standard imaging tasks\n(classification and semantic segmentation), and a vision-language alignment\ntask (text report generation from images). To further demonstrate the drawback\nof language supervision, we show that features from RAD-DINO correlate with\nother medical records (e.g., sex or age) better than language-supervised\nmodels, which are generally not mentioned in radiology reports. Finally, we\nconduct a series of ablations determining the factors in RAD-DINO's\nperformance; notably, we observe that RAD-DINO's downstream performance scales\nwell with the quantity and diversity of training data, demonstrating that\nimage-only supervision is a scalable approach for training a foundational\nbiomedical image encoder.\n","authors":["Fernando Pérez-García","Harshita Sharma","Sam Bond-Taylor","Kenza Bouzid","Valentina Salvatelli","Maximilian Ilse","Shruthi Bannur","Daniel C. Castro","Anton Schwaighofer","Matthew P. Lungren","Maria Wetscherek","Noel Codella","Stephanie L. Hyland","Javier Alvarez-Valle","Ozan Oktay"],"pdf_url":"https://arxiv.org/pdf/2401.10815v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10805v1","updated":"2024-01-19T16:48:49Z","published":"2024-01-19T16:48:49Z","title":"Learning to Visually Connect Actions and their Effects","summary":" In this work, we introduce the novel concept of visually Connecting Actions\nand Their Effects (CATE) in video understanding. CATE can have applications in\nareas like task planning and learning from demonstration. We propose different\nCATE-based task formulations, such as action selection and action\nspecification, where video understanding models connect actions and effects at\nsemantic and fine-grained levels. We observe that different formulations\nproduce representations capturing intuitive action properties. We also design\nvarious baseline models for action selection and action specification. Despite\nthe intuitive nature of the task, we observe that models struggle, and humans\noutperform them by a large margin. The study aims to establish a foundation for\nfuture efforts, showcasing the flexibility and versatility of connecting\nactions and effects in video understanding, with the hope of inspiring advanced\nformulations and models.\n","authors":["Eric Peh","Paritosh Parmar","Basura Fernando"],"pdf_url":"https://arxiv.org/pdf/2401.10805v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10790v1","updated":"2024-01-19T16:21:55Z","published":"2024-01-19T16:21:55Z","title":"Measuring the Impact of Scene Level Objects on Object Detection: Towards\n Quantitative Explanations of Detection Decisions","summary":" Although accuracy and other common metrics can provide a useful window into\nthe performance of an object detection model, they lack a deeper view of the\nmodel's decision process. Regardless of the quality of the training data and\nprocess, the features that an object detection model learns cannot be\nguaranteed. A model may learn a relationship between certain background\ncontext, i.e., scene level objects, and the presence of the labeled classes.\nFurthermore, standard performance verification and metrics would not identify\nthis phenomenon. This paper presents a new black box explainability method for\nadditional verification of object detection models by finding the impact of\nscene level objects on the identification of the objects within the image. By\ncomparing the accuracies of a model on test data with and without certain scene\nlevel objects, the contributions of these objects to the model's performance\nbecomes clearer. The experiment presented here will assess the impact of\nbuildings and people in image context on the detection of emergency road\nvehicles by a fine-tuned YOLOv8 model. A large increase in accuracy in the\npresence of a scene level object will indicate the model's reliance on that\nobject to make its detections. The results of this research lead to providing a\nquantitative explanation of the object detection model's decision process,\nenabling a deeper understanding of the model's performance.\n","authors":["Lynn Vonder Haar","Timothy Elvira","Luke Newcomb","Omar Ochoa"],"pdf_url":"https://arxiv.org/pdf/2401.10790v1.pdf","comment":"9 pages, 4 figures, 1 table"},{"id":"http://arxiv.org/abs/2401.10786v1","updated":"2024-01-19T16:15:37Z","published":"2024-01-19T16:15:37Z","title":"Sat2Scene: 3D Urban Scene Generation from Satellite Images with\n Diffusion","summary":" Directly generating scenes from satellite imagery offers exciting\npossibilities for integration into applications like games and map services.\nHowever, challenges arise from significant view changes and scene scale.\nPrevious efforts mainly focused on image or video generation, lacking\nexploration into the adaptability of scene generation for arbitrary views.\nExisting 3D generation works either operate at the object level or are\ndifficult to utilize the geometry obtained from satellite imagery. To overcome\nthese limitations, we propose a novel architecture for direct 3D scene\ngeneration by introducing diffusion models into 3D sparse representations and\ncombining them with neural rendering techniques. Specifically, our approach\ngenerates texture colors at the point level for a given geometry using a 3D\ndiffusion model first, which is then transformed into a scene representation in\na feed-forward manner. The representation can be utilized to render arbitrary\nviews which would excel in both single-frame quality and inter-frame\nconsistency. Experiments in two city-scale datasets show that our model\ndemonstrates proficiency in generating photo-realistic street-view image\nsequences and cross-view urban scenes from satellite imagery.\n","authors":["Zuoyue Li","Zhenqiang Li","Zhaopeng Cui","Marc Pollefeys","Martin R. Oswald"],"pdf_url":"https://arxiv.org/pdf/2401.10786v1.pdf","comment":"Technical report"},{"id":"http://arxiv.org/abs/2401.09495v2","updated":"2024-01-19T16:11:28Z","published":"2024-01-17T01:33:40Z","title":"IPR-NeRF: Ownership Verification meets Neural Radiance Field","summary":" Neural Radiance Field (NeRF) models have gained significant attention in the\ncomputer vision community in the recent past with state-of-the-art visual\nquality and produced impressive demonstrations. Since then, technopreneurs have\nsought to leverage NeRF models into a profitable business. Therefore, NeRF\nmodels make it worth the risk of plagiarizers illegally copying,\nre-distributing, or misusing those models. This paper proposes a comprehensive\nintellectual property (IP) protection framework for the NeRF model in both\nblack-box and white-box settings, namely IPR-NeRF. In the black-box setting, a\ndiffusion-based solution is introduced to embed and extract the watermark via a\ntwo-stage optimization process. In the white-box setting, a designated digital\nsignature is embedded into the weights of the NeRF model by adopting the sign\nloss objective. Our extensive experiments demonstrate that not only does our\napproach maintain the fidelity (\\ie, the rendering quality) of IPR-NeRF models,\nbut it is also robust against both ambiguity and removal attacks compared to\nprior arts.\n","authors":["Win Kent Ong","Kam Woh Ng","Chee Seng Chan","Yi Zhe Song","Tao Xiang"],"pdf_url":"https://arxiv.org/pdf/2401.09495v2.pdf","comment":"Error on the paper"},{"id":"http://arxiv.org/abs/2401.10777v1","updated":"2024-01-19T15:51:34Z","published":"2024-01-19T15:51:34Z","title":"Determination of efficiency indicators of the stand for intelligent\n control of manual operations in industrial production","summary":" Systems of intelligent control of manual operations in industrial production\nare being implemented in many industries nowadays. Such systems use\nhigh-resolution cameras and computer vision algorithms to automatically track\nthe operator's manipulations and prevent technological errors in the assembly\nprocess. At the same time compliance with safety regulations in the workspace\nis monitored. As a result, the defect rate of manufactured products and the\nnumber of accidents during the manual assembly of any device are decreased.\nBefore implementing an intelligent control system into a real production it is\nnecessary to calculate its efficiency. In order to do it experiments on the\nstand for manual operations control systems were carried out. This paper\nproposes the methodology for calculating the efficiency indicators. This\nmathematical approach is based on the IoU calculation of real- and\npredicted-time intervals between assembly stages. The results show high\nprecision in tracking the validity of manual assembly and do not depend on the\nduration of the assembly process.\n","authors":["Anton Sergeev","Victor Minchenkov","Aleksei Soldatov"],"pdf_url":"https://arxiv.org/pdf/2401.10777v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.01984v2","updated":"2024-01-19T15:51:32Z","published":"2024-01-03T21:24:44Z","title":"AUPIMO: Redefining Visual Anomaly Detection Benchmarks with High Speed\n and Low Tolerance","summary":" Recent advances in visual anomaly detection research have seen AUROC and\nAUPRO scores on public benchmark datasets such as MVTec and VisA converge\ntowards perfect recall, giving the impression that these benchmarks are\nnear-solved. However, high AUROC and AUPRO scores do not always reflect\nqualitative performance, which limits the validity of these metrics in\nreal-world applications. We argue that the artificial ceiling imposed by the\nlack of an adequate evaluation metric restrains progression of the field, and\nit is crucial that we revisit the evaluation metrics used to rate our\nalgorithms. In response, we introduce Per-IMage Overlap (PIMO), a novel metric\nthat addresses the shortcomings of AUROC and AUPRO. PIMO retains the\nrecall-based nature of the existing metrics but introduces two distinctions:\nthe assignment of curves (and respective area under the curve) is per-image,\nand its X-axis relies solely on normal images. Measuring recall per image\nsimplifies instance score indexing and is more robust to noisy annotations. As\nwe show, it also accelerates computation and enables the usage of statistical\ntests to compare models. By imposing low tolerance for false positives on\nnormal images, PIMO provides an enhanced model validation procedure and\nhighlights performance variations across datasets. Our experiments demonstrate\nthat PIMO offers practical advantages and nuanced performance insights that\nredefine anomaly detection benchmarks -- notably challenging the perception\nthat MVTec AD and VisA datasets have been solved by contemporary models.\nAvailable on GitHub: https://github.com/jpcbertoldo/aupimo.\n","authors":["Joao P. C. Bertoldo","Dick Ameln","Ashwin Vaidya","Samet Akçay"],"pdf_url":"https://arxiv.org/pdf/2401.01984v2.pdf","comment":"This research has been conducted during Google Summer of Code 2023\n (GSoC 2023) at OpenVINO (Intel). GSoC 2023 page:\n https://summerofcode.withgoogle.com/archive/2023/projects/SPMopugd"},{"id":"http://arxiv.org/abs/2401.10761v1","updated":"2024-01-19T15:33:46Z","published":"2024-01-19T15:33:46Z","title":"NN-VVC: Versatile Video Coding boosted by self-supervisedly learned\n image coding for machines","summary":" The recent progress in artificial intelligence has led to an ever-increasing\nusage of images and videos by machine analysis algorithms, mainly neural\nnetworks. Nonetheless, compression, storage and transmission of media have\ntraditionally been designed considering human beings as the viewers of the\ncontent. Recent research on image and video coding for machine analysis has\nprogressed mainly in two almost orthogonal directions. The first is represented\nby end-to-end (E2E) learned codecs which, while offering high performance on\nimage coding, are not yet on par with state-of-the-art conventional video\ncodecs and lack interoperability. The second direction considers using the\nVersatile Video Coding (VVC) standard or any other conventional video codec\n(CVC) together with pre- and post-processing operations targeting machine\nanalysis. While the CVC-based methods benefit from interoperability and broad\nhardware and software support, the machine task performance is often lower than\nthe desired level, particularly in low bitrates. This paper proposes a hybrid\ncodec for machines called NN-VVC, which combines the advantages of an\nE2E-learned image codec and a CVC to achieve high performance in both image and\nvideo coding for machines. Our experiments show that the proposed system\nachieved up to -43.20% and -26.8% Bj{\\o}ntegaard Delta rate reduction over VVC\nfor image and video data, respectively, when evaluated on multiple different\ndatasets and machine vision tasks. To the best of our knowledge, this is the\nfirst research paper showing a hybrid video codec that outperforms VVC on\nmultiple datasets and multiple machine vision tasks.\n","authors":["Jukka I. Ahonen","Nam Le","Honglei Zhang","Antti Hallapuro","Francesco Cricri","Hamed Rezazadegan Tavakoli","Miska M. Hannuksela","Esa Rahtu"],"pdf_url":"https://arxiv.org/pdf/2401.10761v1.pdf","comment":"ISM 2023 Best paper award winner version"},{"id":"http://arxiv.org/abs/2212.08044v3","updated":"2024-01-19T15:29:34Z","published":"2022-12-15T18:52:03Z","title":"Benchmarking Robustness of Multimodal Image-Text Models under\n Distribution Shift","summary":" Multimodal image-text models have shown remarkable performance in the past\nfew years. However, evaluating robustness against distribution shifts is\ncrucial before adopting them in real-world applications. In this work, we\ninvestigate the robustness of 12 popular open-sourced image-text models under\ncommon perturbations on five tasks (image-text retrieval, visual reasoning,\nvisual entailment, image captioning, and text-to-image generation). In\nparticular, we propose several new multimodal robustness benchmarks by applying\n17 image perturbation and 16 text perturbation techniques on top of existing\ndatasets. We observe that multimodal models are not robust to image and text\nperturbations, especially to image perturbations. Among the tested perturbation\nmethods, character-level perturbations constitute the most severe distribution\nshift for text, and zoom blur is the most severe shift for image data. We also\nintroduce two new robustness metrics (\\textbf{MMI} for MultiModal Impact score\nand \\textbf{MOR} for Missing Object Rate) for proper evaluations of multimodal\nmodels. We hope our extensive study sheds light on new directions for the\ndevelopment of robust multimodal models. More details can be found on the\nproject webpage: \\url{https://MMRobustness.github.io}.\n","authors":["Jielin Qiu","Yi Zhu","Xingjian Shi","Florian Wenzel","Zhiqiang Tang","Ding Zhao","Bo Li","Mu Li"],"pdf_url":"https://arxiv.org/pdf/2212.08044v3.pdf","comment":"Accepted by Journal of Data-centric Machine Learning Research (DMLR)\n 2024"},{"id":"http://arxiv.org/abs/2401.10752v1","updated":"2024-01-19T15:21:51Z","published":"2024-01-19T15:21:51Z","title":"HiCD: Change Detection in Quality-Varied Images via Hierarchical\n Correlation Distillation","summary":" Advanced change detection techniques primarily target image pairs of equal\nand high quality. However, variations in imaging conditions and platforms\nfrequently lead to image pairs with distinct qualities: one image being\nhigh-quality, while the other being low-quality. These disparities in image\nquality present significant challenges for understanding image pairs\nsemantically and extracting change features, ultimately resulting in a notable\ndecline in performance. To tackle this challenge, we introduce an innovative\ntraining strategy grounded in knowledge distillation. The core idea revolves\naround leveraging task knowledge acquired from high-quality image pairs to\nguide the model's learning process when dealing with image pairs that exhibit\ndifferences in quality. Additionally, we develop a hierarchical correlation\ndistillation approach (involving self-correlation, cross-correlation, and\nglobal correlation). This approach compels the student model to replicate the\ncorrelations inherent in the teacher model, rather than focusing solely on\nindividual features. This ensures effective knowledge transfer while\nmaintaining the student model's training flexibility.\n","authors":["Chao Pang","Xingxing Weng","Jiang Wu","Qiang Wang","Gui-Song Xia"],"pdf_url":"https://arxiv.org/pdf/2401.10752v1.pdf","comment":"accepted by TGRS"},{"id":"http://arxiv.org/abs/2401.10741v1","updated":"2024-01-19T14:59:26Z","published":"2024-01-19T14:59:26Z","title":"Character Recognition in Byzantine Seals with Deep Neural Networks","summary":" Seals are small coin-shaped artifacts, mostly made of lead, held with strings\nto seal letters. This work presents the first attempt towards automatic reading\nof text on Byzantine seal images.Byzantine seals are generally decorated with\niconography on the obverse side and Greek text on the reverse side. Text may\ninclude the sender's name, position in the Byzantine aristocracy, and elements\nof prayers. Both text and iconography are precious literary sources that wait\nto be exploited electronically, so the development of computerized systems for\ninterpreting seals images is of paramount importance. This work's contribution\nis hence a deep, two-stages, character reading pipeline for transcribing\nByzantine seal images. A first deep convolutional neural network (CNN) detects\ncharacters in the seal (character localization). A second convolutional network\nreads the localized characters (character classification). Finally, a\ndiplomatic transcription of the seal is provided by post-processing the two\nnetwork outputs. We provide an experimental evaluation of each CNN in isolation\nand both CNNs in combination. All performances are evaluated by\ncross-validation. Character localization achieves a mean average precision\n(mAP@0.5) greater than 0.9. Classification of characters cropped from ground\ntruth bounding boxes achieves Top-1 accuracy greater than 0.92. End-to-end\nevaluation shows the efficiency of the proposed approach when compared to the\nSoTA for similar tasks.\n","authors":["Théophile Rageau","Laurence Likforman-Sulem","Attilio Fiandrotti","Victoria Eyharabide","Béatrice Caseau","Jean-Claude Cheynet"],"pdf_url":"https://arxiv.org/pdf/2401.10741v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10732v1","updated":"2024-01-19T14:49:56Z","published":"2024-01-19T14:49:56Z","title":"Bridging the gap between image coding for machines and humans","summary":" Image coding for machines (ICM) aims at reducing the bitrate required to\nrepresent an image while minimizing the drop in machine vision analysis\naccuracy. In many use cases, such as surveillance, it is also important that\nthe visual quality is not drastically deteriorated by the compression process.\nRecent works on using neural network (NN) based ICM codecs have shown\nsignificant coding gains against traditional methods; however, the decompressed\nimages, especially at low bitrates, often contain checkerboard artifacts. We\npropose an effective decoder finetuning scheme based on adversarial training to\nsignificantly enhance the visual quality of ICM codecs, while preserving the\nmachine analysis accuracy, without adding extra bitcost or parameters at the\ninference phase. The results show complete removal of the checkerboard\nartifacts at the negligible cost of -1.6% relative change in task performance\nscore. In the cases where some amount of artifacts is tolerable, such as when\nmachine consumption is the primary target, this technique can enhance both\npixel-fidelity and feature-fidelity scores without losing task performance.\n","authors":["Nam Le","Honglei Zhang","Francesco Cricri","Ramin G. Youvalari","Hamed Rezazadegan Tavakoli","Emre Aksu","Miska M. Hannuksela","Esa Rahtu"],"pdf_url":"https://arxiv.org/pdf/2401.10732v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10731v1","updated":"2024-01-19T14:49:42Z","published":"2024-01-19T14:49:42Z","title":"Removal and Selection: Improving RGB-Infrared Object Detection via\n Coarse-to-Fine Fusion","summary":" Object detection in visible (RGB) and infrared (IR) images has been widely\napplied in recent years. Leveraging the complementary characteristics of RGB\nand IR images, the object detector provides reliable and robust object\nlocalization from day to night. Existing fusion strategies directly inject RGB\nand IR images into convolution neural networks, leading to inferior detection\nperformance. Since the RGB and IR features have modality-specific noise, these\nstrategies will worsen the fused features along with the propagation. Inspired\nby the mechanism of human brain processing multimodal information, this work\nintroduces a new coarse-to-fine perspective to purify and fuse two modality\nfeatures. Specifically, following this perspective, we design a Redundant\nSpectrum Removal module to coarsely remove interfering information within each\nmodality and a Dynamic Feature Selection module to finely select the desired\nfeatures for feature fusion. To verify the effectiveness of the coarse-to-fine\nfusion strategy, we construct a new object detector called Removal and\nSelection Detector (RSDet). Extensive experiments on three RGB-IR object\ndetection datasets verify the superior performance of our method.\n","authors":["Tianyi Zhao","Maoxun Yuan","Xingxing Wei"],"pdf_url":"https://arxiv.org/pdf/2401.10731v1.pdf","comment":"9pages, 7figures"},{"id":"http://arxiv.org/abs/2401.10727v1","updated":"2024-01-19T14:44:37Z","published":"2024-01-19T14:44:37Z","title":"Tool-LMM: A Large Multi-Modal Model for Tool Agent Learning","summary":" Recently, the astonishing performance of large language models (LLMs) in\nnatural language comprehension and generation tasks triggered lots of\nexploration of using them as central controllers to build agent systems.\nMultiple studies focus on bridging the LLMs to external tools to extend the\napplication scenarios. However, the current LLMs' perceiving tool-use ability\nis limited to a single text query, which may result in ambiguity in\nunderstanding the users' real intentions. LLMs are expected to eliminate that\nby perceiving the visual- or auditory-grounded instructions' information.\nTherefore, in this paper, we propose Tool-LMM, a system incorporating\nopen-source LLMs and multi-modal encoders so that the learnt LLMs can be\nconscious of multi-modal input instruction and then select the function-matched\ntool correctly. To facilitate the evaluation of the model's capability, we\ncollect a dataset featured by consisting of multi-modal input tools from\nHuggingFace. Another important feature of our dataset is that our dataset also\ncontains multiple potential choices for the same instruction due to the\nexistence of identical functions and synonymous functions, which provides more\npotential solutions for the same query. The experiments reveal that our LMM is\ncapable of recommending appropriate tools for multi-modal instructions. Codes\nand data are available at https://github.com/Tool-LMM/Tool-LMM.\n","authors":["Chenyu Wang","Weixin Luo","Qianyu Chen","Haonan Mai","Jindi Guo","Sixun Dong"," Xiaohua"," Xuan","Zhengxin Li","Lin Ma","Shenghua Gao"],"pdf_url":"https://arxiv.org/pdf/2401.10727v1.pdf","comment":"21 pages, 9 figures, 10 tables"},{"id":"http://arxiv.org/abs/2103.10702v4","updated":"2024-01-19T14:43:57Z","published":"2021-03-19T09:31:08Z","title":"ClawCraneNet: Leveraging Object-level Relation for Text-based Video\n Segmentation","summary":" Text-based video segmentation is a challenging task that segments out the\nnatural language referred objects in videos. It essentially requires semantic\ncomprehension and fine-grained video understanding. Existing methods introduce\nlanguage representation into segmentation models in a bottom-up manner, which\nmerely conducts vision-language interaction within local receptive fields of\nConvNets. We argue that such interaction is not fulfilled since the model can\nbarely construct region-level relationships given partial observations, which\nis contrary to the description logic of natural language/referring expressions.\nIn fact, people usually describe a target object using relations with other\nobjects, which may not be easily understood without seeing the whole video. To\naddress the issue, we introduce a novel top-down approach by imitating how we\nhuman segment an object with the language guidance. We first figure out all\ncandidate objects in videos and then choose the refereed one by parsing\nrelations among those high-level objects. Three kinds of object-level relations\nare investigated for precise relationship understanding, i.e., positional\nrelation, text-guided semantic relation, and temporal relation. Extensive\nexperiments on A2D Sentences and J-HMDB Sentences show our method outperforms\nstate-of-the-art methods by a large margin. Qualitative results also show our\nresults are more explainable.\n","authors":["Chen Liang","Yu Wu","Yawei Luo","Yi Yang"],"pdf_url":"https://arxiv.org/pdf/2103.10702v4.pdf","comment":"Extended version published in\n https://ieeexplore.ieee.org/abstract/document/10083244"},{"id":"http://arxiv.org/abs/2401.10712v1","updated":"2024-01-19T14:22:29Z","published":"2024-01-19T14:22:29Z","title":"Q&A Prompts: Discovering Rich Visual Clues through Mining\n Question-Answer Prompts for VQA requiring Diverse World Knowledge","summary":" With the breakthrough of multi-modal large language models, answering complex\nvisual questions that demand advanced reasoning abilities and world knowledge\nhas become a much more important testbed for developing AI models than ever.\nHowever, equipping AI models with robust cross-modality reasoning ability\nremains challenging since the cognition scheme of humans has not been\nunderstood systematically. In this paper, we believe that if we can collect\nvisual clues in the given image as much as possible, we will recognize the\nimage more accurately, understand the question better, recall relevant\nknowledge more easily, and finally reason out the answer. We discover these\nrich visual clues by mining question-answer pairs in images and sending them\ninto multi-modal large language models as prompts. We call the proposed method\nQ&A Prompts. Specifically, we first use the image-answer pairs and the\ncorresponding questions in the training set as inputs and outputs to train a\nvisual question generation model. Then, we use an image tagging model to\nidentify various instances and send packaged image-tag pairs into the visual\nquestion generation model to generate relevant questions with the extracted\nimage tags as answers. Finally, we encode these generated question-answer pairs\nas prompts with a visual-aware prompting module and send them into pre-trained\nmulti-modal large language models to reason out the final answers. Experimental\nresults show that, compared with state-of-the-art methods, our Q&A Prompts\nachieves substantial improvements on the challenging visual question answering\ndatasets requiring reasoning over diverse world knowledge, such as OK-VQA and\nA-OKVQA.\n","authors":["Haibi Wang","Weifeng Ge"],"pdf_url":"https://arxiv.org/pdf/2401.10712v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10711v1","updated":"2024-01-19T14:21:46Z","published":"2024-01-19T14:21:46Z","title":"Weakly Supervised Gaussian Contrastive Grounding with Large Multimodal\n Models for Video Question Answering","summary":" Video Question Answering (VideoQA) aims to answer natural language questions\nbased on the information observed in videos. Despite the recent success of\nLarge Multimodal Models (LMMs) in image-language understanding and reasoning,\nthey deal with VideoQA insufficiently by simply taking uniformly sampled frames\nas visual inputs, which ignores question-relevant visual clues. Moreover, there\nare no human annotations for question-critical timestamps in existing VideoQA\ndatasets. In light of this, we propose a novel weakly supervised framework to\nenforce the LMMs to reason out the answers with question-critical moments as\nvisual inputs. Specifically, we fuse the question and answer pairs as event\ndescriptions to find multiple keyframes as target moments, which will be\npseudo-labels. With these pseudo-labels as additionally weak supervision, we\ndevise a lightweight Gaussian-based Contrastive Grounding (GCG) module. GCG\nlearns multiple Gaussian functions to characterize the temporal structure of\nthe video, and sample question-critical frames as positive moments to be the\nvisual inputs of LMMs. Extensive experiments on several VideoQA benchmarks\nverify the effectiveness of our framework, and we achieve substantial\nimprovements compared to previous state-of-the-art methods.\n","authors":["Haibo Wang","Chenghang Lai","Yixuan Sun","Weifeng Ge"],"pdf_url":"https://arxiv.org/pdf/2401.10711v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10709v1","updated":"2024-01-19T14:14:26Z","published":"2024-01-19T14:14:26Z","title":"Dense 3D Reconstruction Through Lidar: A Comparative Study on Ex-vivo\n Porcine Tissue","summary":" New sensing technologies and more advanced processing algorithms are\ntransforming computer-integrated surgery. While researchers are actively\ninvestigating depth sensing and 3D reconstruction for vision-based surgical\nassistance, it remains difficult to achieve real-time, accurate, and robust 3D\nrepresentations of the abdominal cavity for minimally invasive surgery. Thus,\nthis work uses quantitative testing on fresh ex-vivo porcine tissue to\nthoroughly characterize the quality with which a 3D laser-based time-of-flight\nsensor (lidar) can perform anatomical surface reconstruction. Ground-truth\nsurface shapes are captured with a commercial laser scanner, and the resulting\nsigned error fields are analyzed using rigorous statistical tools. When\ncompared to modern learning-based stereo matching from endoscopic images,\ntime-of-flight sensing demonstrates higher precision, lower processing delay,\nhigher frame rate, and superior robustness against sensor distance and poor\nillumination. Furthermore, we report on the potential negative effect of\nnear-infrared light penetration on the accuracy of lidar measurements across\ndifferent tissue samples, identifying a significant measured depth offset for\nmuscle in contrast to fat and liver. Our findings highlight the potential of\nlidar for intraoperative 3D perception and point toward new methods that\ncombine complementary time-of-flight and spectral imaging.\n","authors":["Guido Caccianiga","Julian Nubert","Marco Hutter","Katherine J. Kuchenbecker"],"pdf_url":"https://arxiv.org/pdf/2401.10709v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.11795v2","updated":"2024-01-19T14:08:38Z","published":"2022-10-21T08:18:49Z","title":"PoseScript: Linking 3D Human Poses and Natural Language","summary":" Natural language plays a critical role in many computer vision applications,\nsuch as image captioning, visual question answering, and cross-modal retrieval,\nto provide fine-grained semantic information. Unfortunately, while human pose\nis key to human understanding, current 3D human pose datasets lack detailed\nlanguage descriptions. To address this issue, we have introduced the PoseScript\ndataset. This dataset pairs more than six thousand 3D human poses from AMASS\nwith rich human-annotated descriptions of the body parts and their spatial\nrelationships. Additionally, to increase the size of the dataset to a scale\nthat is compatible with data-hungry learning algorithms, we have proposed an\nelaborate captioning process that generates automatic synthetic descriptions in\nnatural language from given 3D keypoints. This process extracts low-level pose\ninformation, known as \"posecodes\", using a set of simple but generic rules on\nthe 3D keypoints. These posecodes are then combined into higher level textual\ndescriptions using syntactic rules. With automatic annotations, the amount of\navailable data significantly scales up (100k), making it possible to\neffectively pretrain deep models for finetuning on human captions. To showcase\nthe potential of annotated poses, we present three multi-modal learning tasks\nthat utilize the PoseScript dataset. Firstly, we develop a pipeline that maps\n3D poses and textual descriptions into a joint embedding space, allowing for\ncross-modal retrieval of relevant poses from large-scale datasets. Secondly, we\nestablish a baseline for a text-conditioned model generating 3D poses. Thirdly,\nwe present a learned process for generating pose descriptions. These\napplications demonstrate the versatility and usefulness of annotated poses in\nvarious tasks and pave the way for future research in the field.\n","authors":["Ginger Delmas","Philippe Weinzaepfel","Thomas Lucas","Francesc Moreno-Noguer","Grégory Rogez"],"pdf_url":"https://arxiv.org/pdf/2210.11795v2.pdf","comment":"Extended version of the ECCV 2022 paper"},{"id":"http://arxiv.org/abs/2106.01061v2","updated":"2024-01-19T13:44:46Z","published":"2021-06-02T10:26:13Z","title":"Rethinking Cross-modal Interaction from a Top-down Perspective for\n Referring Video Object Segmentation","summary":" Referring video object segmentation (RVOS) aims to segment video objects with\nthe guidance of natural language reference. Previous methods typically tackle\nRVOS through directly grounding linguistic reference over the image lattice.\nSuch bottom-up strategy fails to explore object-level cues, easily leading to\ninferior results. In this work, we instead put forward a two-stage, top-down\nRVOS solution. First, an exhaustive set of object tracklets is constructed by\npropagating object masks detected from several sampled frames to the entire\nvideo. Second, a Transformer-based tracklet-language grounding module is\nproposed, which models instance-level visual relations and cross-modal\ninteractions simultaneously and efficiently. Our model ranks first place on\nCVPR2021 Referring Youtube-VOS challenge.\n","authors":["Chen Liang","Yu Wu","Tianfei Zhou","Wenguan Wang","Zongxin Yang","Yunchao Wei","Yi Yang"],"pdf_url":"https://arxiv.org/pdf/2106.01061v2.pdf","comment":"Champion solution in YouTube-VOS 2021 Track 3. Extended version\n published in https://ieeexplore.ieee.org/abstract/document/10083244"},{"id":"http://arxiv.org/abs/2301.13359v4","updated":"2024-01-19T13:25:03Z","published":"2023-01-31T01:24:45Z","title":"IM-IAD: Industrial Image Anomaly Detection Benchmark in Manufacturing","summary":" Image anomaly detection (IAD) is an emerging and vital computer vision task\nin industrial manufacturing (IM). Recently, many advanced algorithms have been\nreported, but their performance deviates considerably with various IM settings.\nWe realize that the lack of a uniform IM benchmark is hindering the development\nand usage of IAD methods in real-world applications. In addition, it is\ndifficult for researchers to analyze IAD algorithms without a uniform\nbenchmark. To solve this problem, we propose a uniform IM benchmark, for the\nfirst time, to assess how well these algorithms perform, which includes various\nlevels of supervision (unsupervised versus fully supervised), learning\nparadigms (few-shot, continual and noisy label), and efficiency (memory usage\nand inference speed). Then, we construct a comprehensive image anomaly\ndetection benchmark (IM-IAD), which includes 19 algorithms on seven major\ndatasets with a uniform setting. Extensive experiments (17,017 total) on IM-IAD\nprovide in-depth insights into IAD algorithm redesign or selection. Moreover,\nthe proposed IM-IAD benchmark challenges existing algorithms and suggests\nfuture research directions. To foster reproducibility and accessibility, the\nsource code of IM-IAD is uploaded on the website,\nhttps://github.com/M-3LAB/IM-IAD.\n","authors":["Guoyang Xie","Jinbao Wang","Jiaqi Liu","Jiayi Lyu","Yong Liu","Chengjie Wang","Feng Zheng","Yaochu Jin"],"pdf_url":"https://arxiv.org/pdf/2301.13359v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.13310v2","updated":"2024-01-19T13:03:04Z","published":"2023-05-22T17:59:43Z","title":"Matcher: Segment Anything with One Shot Using All-Purpose Feature\n Matching","summary":" Powered by large-scale pre-training, vision foundation models exhibit\nsignificant potential in open-world image understanding. However, unlike large\nlanguage models that excel at directly tackling various language tasks, vision\nfoundation models require a task-specific model structure followed by\nfine-tuning on specific tasks. In this work, we present Matcher, a novel\nperception paradigm that utilizes off-the-shelf vision foundation models to\naddress various perception tasks. Matcher can segment anything by using an\nin-context example without training. Additionally, we design three effective\ncomponents within the Matcher framework to collaborate with these foundation\nmodels and unleash their full potential in diverse perception tasks. Matcher\ndemonstrates impressive generalization performance across various segmentation\ntasks, all without training. For example, it achieves 52.7% mIoU on COCO-20$^i$\nwith one example, surpassing the state-of-the-art specialist model by 1.6%. In\naddition, Matcher achieves 33.0% mIoU on the proposed LVIS-92$^i$ for one-shot\nsemantic segmentation, outperforming the state-of-the-art generalist model by\n14.4%. Our visualization results further showcase the open-world generality and\nflexibility of Matcher when applied to images in the wild. Our code can be\nfound at https://github.com/aim-uofa/Matcher.\n","authors":["Yang Liu","Muzhi Zhu","Hengtao Li","Hao Chen","Xinlong Wang","Chunhua Shen"],"pdf_url":"https://arxiv.org/pdf/2305.13310v2.pdf","comment":"Accepted to ICLR2024"},{"id":"http://arxiv.org/abs/2203.09773v2","updated":"2024-01-19T13:01:44Z","published":"2022-03-18T07:35:26Z","title":"Local-Global Context Aware Transformer for Language-Guided Video\n Segmentation","summary":" We explore the task of language-guided video segmentation (LVS). Previous\nalgorithms mostly adopt 3D CNNs to learn video representation, struggling to\ncapture long-term context and easily suffering from visual-linguistic\nmisalignment. In light of this, we present Locater (local-global context aware\nTransformer), which augments the Transformer architecture with a finite memory\nso as to query the entire video with the language expression in an efficient\nmanner. The memory is designed to involve two components -- one for\npersistently preserving global video content, and one for dynamically gathering\nlocal temporal context and segmentation history. Based on the memorized\nlocal-global context and the particular content of each frame, Locater\nholistically and flexibly comprehends the expression as an adaptive query\nvector for each frame. The vector is used to query the corresponding frame for\nmask generation. The memory also allows Locater to process videos with linear\ntime complexity and constant size memory, while Transformer-style\nself-attention computation scales quadratically with sequence length. To\nthoroughly examine the visual grounding capability of LVS models, we contribute\na new LVS dataset, A2D-S+, which is built upon A2D-S dataset but poses\nincreased challenges in disambiguating among similar objects. Experiments on\nthree LVS datasets and our A2D-S+ show that Locater outperforms previous\nstate-of-the-arts. Further, we won the 1st place in the Referring Video Object\nSegmentation Track of the 3rd Large-scale Video Object Segmentation Challenge,\nwhere Locater served as the foundation for the winning solution. Our code and\ndataset are available at: https://github.com/leonnnop/Locater\n","authors":["Chen Liang","Wenguan Wang","Tianfei Zhou","Jiaxu Miao","Yawei Luo","Yi Yang"],"pdf_url":"https://arxiv.org/pdf/2203.09773v2.pdf","comment":"Accepted by TPAMI. Code, data: https://github.com/leonnnop/Locater"},{"id":"http://arxiv.org/abs/2401.10666v1","updated":"2024-01-19T12:40:54Z","published":"2024-01-19T12:40:54Z","title":"MixNet: Towards Effective and Efficient UHD Low-Light Image Enhancement","summary":" With the continuous advancement of imaging devices, the prevalence of\nUltra-High-Definition (UHD) images is rising. Although many image restoration\nmethods have achieved promising results, they are not directly applicable to\nUHD images on devices with limited computational resources due to the\ninherently high computational complexity of UHD images. In this paper, we focus\non the task of low-light image enhancement (LLIE) and propose a novel LLIE\nmethod called MixNet, which is designed explicitly for UHD images. To capture\nthe long-range dependency of features without introducing excessive\ncomputational complexity, we present the Global Feature Modulation Layer\n(GFML). GFML associates features from different views by permuting the feature\nmaps, enabling efficient modeling of long-range dependency. In addition, we\nalso design the Local Feature Modulation Layer (LFML) and Feed-forward Layer\n(FFL) to capture local features and transform features into a compact\nrepresentation. This way, our MixNet achieves effective LLIE with few model\nparameters and low computational complexity. We conducted extensive experiments\non both synthetic and real-world datasets, and the comprehensive results\ndemonstrate that our proposed method surpasses the performance of current\nstate-of-the-art methods. The code will be available at\n\\url{https://github.com/zzr-idam/MixNet}.\n","authors":["Chen Wu","Zhuoran Zheng","Xiuyi Jia","Wenqi Ren"],"pdf_url":"https://arxiv.org/pdf/2401.10666v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.15420v2","updated":"2024-01-19T12:34:42Z","published":"2023-11-26T21:04:28Z","title":"Data-Driven Modelling for Harmonic Current Emission in Low-Voltage Grid\n Using MCReSANet with Interpretability Analysis","summary":" Even though the use of power electronics PE loads offers enhanced electrical\nenergy conversion efficiency and control, they remain the primary sources of\nharmonics in grids. When diverse loads are connected in the distribution\nsystem, their interactions complicate establishing analytical models for the\nrelationship between harmonic voltages and currents. To solve this, our paper\npresents a data-driven model using MCReSANet to construct the highly nonlinear\nbetween harmonic voltage and current. Two datasets from PCCs in Finland and\nGermany are utilized, which demonstrates that MCReSANet is capable of\nestablishing accurate nonlinear mappings, even in the presence of various\nnetwork characteristics for selected Finland and Germany datasets. The model\nbuilt by MCReSANet can improve the MAE by 10% and 14% compared to the CNN, and\nby 8% and 17% compared to the MLP for both Finnish and German datasets, also\nshowing much lower model uncertainty than others. This is a crucial\nprerequisite for more precise SHAP value-based feature importance analysis,\nwhich is a method for the model interpretability analysis in this paper. The\nresults by feature importance analysis show the detailed relationships between\neach order of harmonic voltage and current in the distribution system. There is\nan interactive impact on each order of harmonic current, but some orders of\nharmonic voltages have a dominant influence on harmonic current emissions:\npositive sequence and zero sequence harmonics have the dominant importance in\nthe Finnish and German networks, respectively, which conforms to the pattern of\nconnected load types in two selected Finnish and German datasets. This paper\nenhances the potential for understanding and predicting harmonic current\nemissions by diverse PE loads in distribution systems, which is beneficial to\nmore effective management for optimizing power quality in diverse grid\nenvironments.\n","authors":["Jieyu Yao","Hao Yu","Paul Judge","Jiabin Jia","Sasa Djokic","Verner Püvi","Matti Lehtonen","Jan Meyer"],"pdf_url":"https://arxiv.org/pdf/2311.15420v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.16516v2","updated":"2024-01-19T12:29:47Z","published":"2023-12-27T10:49:19Z","title":"ConstScene: Dataset and Model for Advancing Robust Semantic Segmentation\n in Construction Environments","summary":" The increasing demand for autonomous machines in construction environments\nnecessitates the development of robust object detection algorithms that can\nperform effectively across various weather and environmental conditions. This\npaper introduces a new semantic segmentation dataset specifically tailored for\nconstruction sites, taking into account the diverse challenges posed by adverse\nweather and environmental conditions. The dataset is designed to enhance the\ntraining and evaluation of object detection models, fostering their\nadaptability and reliability in real-world construction applications. Our\ndataset comprises annotated images captured under a wide range of different\nweather conditions, including but not limited to sunny days, rainy periods,\nfoggy atmospheres, and low-light situations. Additionally, environmental\nfactors such as the existence of dirt/mud on the camera lens are integrated\ninto the dataset through actual captures and synthetic generation to simulate\nthe complex conditions prevalent in construction sites. We also generate\nsynthetic images of the annotations including precise semantic segmentation\nmasks for various objects commonly found in construction environments, such as\nwheel loader machines, personnel, cars, and structural elements. To demonstrate\nthe dataset's utility, we evaluate state-of-the-art object detection algorithms\non our proposed benchmark. The results highlight the dataset's success in\nadversarial training models across diverse conditions, showcasing its efficacy\ncompared to existing datasets that lack such environmental variability.\n","authors":["Maghsood Salimi","Mohammad Loni","Sara Afshar","Antonio Cicchetti","Marjan Sirjani"],"pdf_url":"https://arxiv.org/pdf/2312.16516v2.pdf","comment":"9 pages"},{"id":"http://arxiv.org/abs/2401.10659v1","updated":"2024-01-19T12:26:51Z","published":"2024-01-19T12:26:51Z","title":"BadODD: Bangladeshi Autonomous Driving Object Detection Dataset","summary":" We propose a comprehensive dataset for object detection in diverse driving\nenvironments across 9 districts in Bangladesh. The dataset, collected\nexclusively from smartphone cameras, provided a realistic representation of\nreal-world scenarios, including day and night conditions. Most existing\ndatasets lack suitable classes for autonomous navigation on Bangladeshi roads,\nmaking it challenging for researchers to develop models that can handle the\nintricacies of road scenarios. To address this issue, the authors proposed a\nnew set of classes based on characteristics rather than local vehicle names.\nThe dataset aims to encourage the development of models that can handle the\nunique challenges of Bangladeshi road scenarios for the effective deployment of\nautonomous vehicles. The dataset did not consist of any online images to\nsimulate real-world conditions faced by autonomous vehicles. The classification\nof vehicles is challenging because of the diverse range of vehicles on\nBangladeshi roads, including those not found elsewhere in the world. The\nproposed classification system is scalable and can accommodate future vehicles,\nmaking it a valuable resource for researchers in the autonomous vehicle sector.\n","authors":["Mirza Nihal Baig","Rony Hajong","Mahdi Murshed Patwary","Mohammad Shahidur Rahman","Husne Ara Chowdhury"],"pdf_url":"https://arxiv.org/pdf/2401.10659v1.pdf","comment":"7 pages"},{"id":"http://arxiv.org/abs/2312.08010v2","updated":"2024-01-19T12:19:48Z","published":"2023-12-13T09:33:08Z","title":"EZ-CLIP: Efficient Zeroshot Video Action Recognition","summary":" Recent advancements in large-scale pre-training of visual-language models on\npaired image-text data have demonstrated impressive generalization capabilities\nfor zero-shot tasks. Building on this success, efforts have been made to adapt\nthese image-based visual-language models, such as CLIP, for videos extending\ntheir zero-shot capabilities to the video domain. While these adaptations have\nshown promising results, they come at a significant computational cost and\nstruggle with effectively modeling the crucial temporal aspects inherent to the\nvideo domain. In this study, we present EZ-CLIP, a simple and efficient\nadaptation of CLIP that addresses these challenges. EZ-CLIP leverages temporal\nvisual prompting for seamless temporal adaptation, requiring no fundamental\nalterations to the core CLIP architecture while preserving its remarkable\ngeneralization abilities. Moreover, we introduce a novel learning objective\nthat guides the temporal visual prompts to focus on capturing motion, thereby\nenhancing its learning capabilities from video data. We conducted extensive\nexperiments on five different benchmark datasets, thoroughly evaluating EZ-CLIP\nfor zero-shot learning and base-to-novel video action recognition, and also\ndemonstrating its potential for few-shot generalization.Impressively, with a\nmere 5.2 million learnable parameters (as opposed to the 71.1 million in the\nprior best model), EZ-CLIP can be efficiently trained on a single GPU,\noutperforming existing approaches in several evaluations.\n","authors":["Shahzad Ahmad","Sukalpa Chanda","Yogesh S Rawat"],"pdf_url":"https://arxiv.org/pdf/2312.08010v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.07823v4","updated":"2024-01-19T12:18:28Z","published":"2023-12-13T01:16:50Z","title":"Semantic Lens: Instance-Centric Semantic Alignment for Video\n Super-Resolution","summary":" As a critical clue of video super-resolution (VSR), inter-frame alignment\nsignificantly impacts overall performance. However, accurate pixel-level\nalignment is a challenging task due to the intricate motion interweaving in the\nvideo. In response to this issue, we introduce a novel paradigm for VSR named\nSemantic Lens, predicated on semantic priors drawn from degraded videos.\nSpecifically, video is modeled as instances, events, and scenes via a Semantic\nExtractor. Those semantics assist the Pixel Enhancer in understanding the\nrecovered contents and generating more realistic visual results. The distilled\nglobal semantics embody the scene information of each frame, while the\ninstance-specific semantics assemble the spatial-temporal contexts related to\neach instance. Furthermore, we devise a Semantics-Powered Attention\nCross-Embedding (SPACE) block to bridge the pixel-level features with semantic\nknowledge, composed of a Global Perspective Shifter (GPS) and an\nInstance-Specific Semantic Embedding Encoder (ISEE). Concretely, the GPS module\ngenerates pairs of affine transformation parameters for pixel-level feature\nmodulation conditioned on global semantics. After that, the ISEE module\nharnesses the attention mechanism to align the adjacent frames in the\ninstance-centric semantic space. In addition, we incorporate a simple yet\neffective pre-alignment module to alleviate the difficulty of model training.\nExtensive experiments demonstrate the superiority of our model over existing\nstate-of-the-art VSR methods.\n","authors":["Qi Tang","Yao Zhao","Meiqin Liu","Jian Jin","Chao Yao"],"pdf_url":"https://arxiv.org/pdf/2312.07823v4.pdf","comment":"Accepted to AAAI 2024"},{"id":"http://arxiv.org/abs/2401.10643v1","updated":"2024-01-19T11:45:10Z","published":"2024-01-19T11:45:10Z","title":"A Comprehensive Survey on Deep-Learning-based Vehicle Re-Identification:\n Models, Data Sets and Challenges","summary":" Vehicle re-identification (ReID) endeavors to associate vehicle images\ncollected from a distributed network of cameras spanning diverse traffic\nenvironments. This task assumes paramount importance within the spectrum of\nvehicle-centric technologies, playing a pivotal role in deploying Intelligent\nTransportation Systems (ITS) and advancing smart city initiatives. Rapid\nadvancements in deep learning have significantly propelled the evolution of\nvehicle ReID technologies in recent years. Consequently, undertaking a\ncomprehensive survey of methodologies centered on deep learning for vehicle\nre-identification has become imperative and inescapable. This paper extensively\nexplores deep learning techniques applied to vehicle ReID. It outlines the\ncategorization of these methods, encompassing supervised and unsupervised\napproaches, delves into existing research within these categories, introduces\ndatasets and evaluation criteria, and delineates forthcoming challenges and\npotential research directions. This comprehensive assessment examines the\nlandscape of deep learning in vehicle ReID and establishes a foundation and\nstarting point for future works. It aims to serve as a complete reference by\nhighlighting challenges and emerging trends, fostering advancements and\napplications in vehicle ReID utilizing deep learning models.\n","authors":["Ali Amiri","Aydin Kaya","Ali Seydi Keceli"],"pdf_url":"https://arxiv.org/pdf/2401.10643v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10640v1","updated":"2024-01-19T11:35:52Z","published":"2024-01-19T11:35:52Z","title":"A comprehensive study on fidelity metrics for XAI","summary":" The use of eXplainable Artificial Intelligence (XAI) systems has introduced a\nset of challenges that need resolution. Herein, we focus on how to correctly\nselect an XAI method, an open questions within the field. The inherent\ndifficulty of this task is due to the lack of a ground truth. Several authors\nhave proposed metrics to approximate the fidelity of different XAI methods.\nThese metrics lack verification and have concerning disagreements. In this\nstudy, we proposed a novel methodology to verify fidelity metrics, using a\nwell-known transparent model, namely a decision tree. This model allowed us to\nobtain explanations with perfect fidelity. Our proposal constitutes the first\nobjective benchmark for these metrics, facilitating a comparison of existing\nproposals, and surpassing existing methods. We applied our benchmark to assess\nthe existing fidelity metrics in two different experiments, each using public\ndatasets comprising 52,000 images. The images from these datasets had a size a\n128 by 128 pixels and were synthetic data that simplified the training process.\nAll metric values, indicated a lack of fidelity, with the best one showing a 30\n\\% deviation from the expected values for perfect explanation. Our\nexperimentation led us to conclude that the current fidelity metrics are not\nreliable enough to be used in real scenarios. From this finding, we deemed it\nnecessary to development new metrics, to avoid the detected problems, and we\nrecommend the usage of our proposal as a benchmark within the scientific\ncommunity to address these limitations.\n","authors":["Miquel Miró-Nicolau","Antoni Jaume-i-Capó","Gabriel Moyà-Alcover"],"pdf_url":"https://arxiv.org/pdf/2401.10640v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10637v1","updated":"2024-01-19T11:35:07Z","published":"2024-01-19T11:35:07Z","title":"Towards Universal Unsupervised Anomaly Detection in Medical Imaging","summary":" The increasing complexity of medical imaging data underscores the need for\nadvanced anomaly detection methods to automatically identify diverse\npathologies. Current methods face challenges in capturing the broad spectrum of\nanomalies, often limiting their use to specific lesion types in brain scans. To\naddress this challenge, we introduce a novel unsupervised approach, termed\n\\textit{Reversed Auto-Encoders (RA)}, designed to create realistic\npseudo-healthy reconstructions that enable the detection of a wider range of\npathologies. We evaluate the proposed method across various imaging modalities,\nincluding magnetic resonance imaging (MRI) of the brain, pediatric wrist X-ray,\nand chest X-ray, and demonstrate superior performance in detecting anomalies\ncompared to existing state-of-the-art methods. Our unsupervised anomaly\ndetection approach may enhance diagnostic accuracy in medical imaging by\nidentifying a broader range of unknown pathologies. Our code is publicly\navailable at: \\url{https://github.com/ci-ber/RA}.\n","authors":["Cosmin I. Bercea","Benedikt Wiestler","Daniel Rueckert","Julia A. Schnabel"],"pdf_url":"https://arxiv.org/pdf/2401.10637v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10620v1","updated":"2024-01-19T10:52:57Z","published":"2024-01-19T10:52:57Z","title":"Polytopic Autoencoders with Smooth Clustering for Reduced-order\n Modelling of Flows","summary":" With the advancement of neural networks, there has been a notable increase,\nboth in terms of quantity and variety, in research publications concerning the\napplication of autoencoders to reduced-order models. We propose a polytopic\nautoencoder architecture that includes a lightweight nonlinear encoder, a\nconvex combination decoder, and a smooth clustering network. Supported by\nseveral proofs, the model architecture ensures that all reconstructed states\nlie within a polytope, accompanied by a metric indicating the quality of the\nconstructed polytopes, referred to as polytope error. Additionally, it offers a\nminimal number of convex coordinates for polytopic linear-parameter varying\nsystems while achieving acceptable reconstruction errors compared to proper\northogonal decomposition (POD). To validate our proposed model, we conduct\nsimulations involving two flow scenarios with the incompressible Navier-Stokes\nequation. Numerical results demonstrate the guaranteed properties of the model,\nlow reconstruction errors compared to POD, and the improvement in error using a\nclustering network.\n","authors":["Jan Heiland","Yongho Kim"],"pdf_url":"https://arxiv.org/pdf/2401.10620v1.pdf","comment":"28 pages, 18 figures"},{"id":"http://arxiv.org/abs/2401.10608v1","updated":"2024-01-19T10:37:27Z","published":"2024-01-19T10:37:27Z","title":"M2ORT: Many-To-One Regression Transformer for Spatial Transcriptomics\n Prediction from Histopathology Images","summary":" The advancement of Spatial Transcriptomics (ST) has facilitated the\nspatially-aware profiling of gene expressions based on histopathology images.\nAlthough ST data offers valuable insights into the micro-environment of tumors,\nits acquisition cost remains expensive. Therefore, directly predicting the ST\nexpressions from digital pathology images is desired. Current methods usually\nadopt existing regression backbones for this task, which ignore the inherent\nmulti-scale hierarchical data structure of digital pathology images. To address\nthis limit, we propose M2ORT, a many-to-one regression Transformer that can\naccommodate the hierarchical structure of the pathology images through a\ndecoupled multi-scale feature extractor. Different from traditional models that\nare trained with one-to-one image-label pairs, M2ORT accepts multiple pathology\nimages of different magnifications at a time to jointly predict the gene\nexpressions at their corresponding common ST spot, aiming at learning a\nmany-to-one relationship through training. We have tested M2ORT on three public\nST datasets and the experimental results show that M2ORT can achieve\nstate-of-the-art performance with fewer parameters and floating-point\noperations (FLOPs). The code is available at:\nhttps://github.com/Dootmaan/M2ORT/.\n","authors":["Hongyi Wang","Xiuju Du","Jing Liu","Shuyi Ouyang","Yen-Wei Chen","Lanfen Lin"],"pdf_url":"https://arxiv.org/pdf/2401.10608v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10191v2","updated":"2024-01-19T10:01:36Z","published":"2024-01-18T18:25:29Z","title":"Divide and not forget: Ensemble of selectively trained experts in\n Continual Learning","summary":" Class-incremental learning is becoming more popular as it helps models widen\ntheir applicability while not forgetting what they already know. A trend in\nthis area is to use a mixture-of-expert technique, where different models work\ntogether to solve the task. However, the experts are usually trained all at\nonce using whole task data, which makes them all prone to forgetting and\nincreasing computational burden. To address this limitation, we introduce a\nnovel approach named SEED. SEED selects only one, the most optimal expert for a\nconsidered task, and uses data from this task to fine-tune only this expert.\nFor this purpose, each expert represents each class with a Gaussian\ndistribution, and the optimal expert is selected based on the similarity of\nthose distributions. Consequently, SEED increases diversity and heterogeneity\nwithin the experts while maintaining the high stability of this ensemble\nmethod. The extensive experiments demonstrate that SEED achieves\nstate-of-the-art performance in exemplar-free settings across various\nscenarios, showing the potential of expert diversification through data in\ncontinual learning.\n","authors":["Grzegorz Rypeść","Sebastian Cygert","Valeriya Khan","Tomasz Trzciński","Bartosz Zieliński","Bartłomiej Twardowski"],"pdf_url":"https://arxiv.org/pdf/2401.10191v2.pdf","comment":"Accepted for ICLR 2024 (main track), code is available at:\n https://github.com/grypesc/SEED"},{"id":"http://arxiv.org/abs/2401.10588v1","updated":"2024-01-19T09:58:06Z","published":"2024-01-19T09:58:06Z","title":"DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval","summary":" Text-video retrieval is a critical multi-modal task to find the most relevant\nvideo for a text query. Although pretrained models like CLIP have demonstrated\nimpressive potential in this area, the rising cost of fully finetuning these\nmodels due to increasing model size continues to pose a problem. To address\nthis challenge, prompt tuning has emerged as an alternative. However, existing\nworks still face two problems when adapting pretrained image-text models to\ndownstream video-text tasks: (1) The visual encoder could only encode\nframe-level features and failed to extract global-level general video\ninformation. (2) Equipping the visual and text encoder with separated prompts\nfailed to mitigate the visual-text modality gap. To this end, we propose DGL, a\ncross-modal Dynamic prompt tuning method with Global-Local video attention. In\ncontrast to previous prompt tuning methods, we employ the shared latent space\nto generate local-level text and frame prompts that encourage inter-modal\ninteraction. Furthermore, we propose modeling video in a global-local attention\nmechanism to capture global video information from the perspective of prompt\ntuning. Extensive experiments reveal that when only 0.67% parameters are tuned,\nour cross-modal prompt tuning strategy DGL outperforms or is comparable to\nfully finetuning methods on MSR-VTT, VATEX, LSMDC, and ActivityNet datasets.\nCode will be available at https://github.com/knightyxp/DGL\n","authors":["Xiangpeng Yang","Linchao Zhu","Xiaohan Wang","Yi Yang"],"pdf_url":"https://arxiv.org/pdf/2401.10588v1.pdf","comment":"AAAI2024, Code will be available at https://github.com/knightyxp/DGL"},{"id":"http://arxiv.org/abs/2303.06088v6","updated":"2024-01-19T09:45:02Z","published":"2023-03-10T17:09:04Z","title":"Towards domain-invariant Self-Supervised Learning with Batch Styles\n Standardization","summary":" In Self-Supervised Learning (SSL), models are typically pretrained,\nfine-tuned, and evaluated on the same domains. However, they tend to perform\npoorly when evaluated on unseen domains, a challenge that Unsupervised Domain\nGeneralization (UDG) seeks to address. Current UDG methods rely on domain\nlabels, which are often challenging to collect, and domain-specific\narchitectures that lack scalability when confronted with numerous domains,\nmaking the current methodology impractical and rigid. Inspired by\ncontrastive-based UDG methods that mitigate spurious correlations by\nrestricting comparisons to examples from the same domain, we hypothesize that\neliminating style variability within a batch could provide a more convenient\nand flexible way to reduce spurious correlations without requiring domain\nlabels. To verify this hypothesis, we introduce Batch Styles Standardization\n(BSS), a relatively simple yet powerful Fourier-based method to standardize the\nstyle of images in a batch specifically designed for integration with SSL\nmethods to tackle UDG. Combining BSS with existing SSL methods offers serious\nadvantages over prior UDG methods: (1) It eliminates the need for domain labels\nor domain-specific network components to enhance domain-invariance in SSL\nrepresentations, and (2) offers flexibility as BSS can be seamlessly integrated\nwith diverse contrastive-based but also non-contrastive-based SSL methods.\nExperiments on several UDG datasets demonstrate that it significantly improves\ndownstream task performances on unseen domains, often outperforming or rivaling\nwith UDG methods. Finally, this work clarifies the underlying mechanisms\ncontributing to BSS's effectiveness in improving domain-invariance in SSL\nrepresentations and performances on unseen domain.\n","authors":["Marin Scalbert","Maria Vakalopoulou","Florent Couzinié-Devy"],"pdf_url":"https://arxiv.org/pdf/2303.06088v6.pdf","comment":"Accepted at ICLR 2024"},{"id":"http://arxiv.org/abs/2401.10578v1","updated":"2024-01-19T09:41:09Z","published":"2024-01-19T09:41:09Z","title":"3D Shape Completion on Unseen Categories:A Weakly-supervised Approach","summary":" 3D shapes captured by scanning devices are often incomplete due to occlusion.\n3D shape completion methods have been explored to tackle this limitation.\nHowever, most of these methods are only trained and tested on a subset of\ncategories, resulting in poor generalization to unseen categories. In this\npaper, we introduce a novel weakly-supervised framework to reconstruct the\ncomplete shapes from unseen categories. We first propose an end-to-end\nprior-assisted shape learning network that leverages data from the seen\ncategories to infer a coarse shape. Specifically, we construct a prior bank\nconsisting of representative shapes from the seen categories. Then, we design a\nmulti-scale pattern correlation module for learning the complete shape of the\ninput by analyzing the correlation between local patterns within the input and\nthe priors at various scales. In addition, we propose a self-supervised shape\nrefinement model to further refine the coarse shape. Considering the shape\nvariability of 3D objects across categories, we construct a category-specific\nprior bank to facilitate shape refinement. Then, we devise a voxel-based\npartial matching loss and leverage the partial scans to drive the refinement\nprocess. Extensive experimental results show that our approach is superior to\nstate-of-the-art methods by a large margin.\n","authors":["Lintai Wu","Junhui Hou","Linqi Song","Yong Xu"],"pdf_url":"https://arxiv.org/pdf/2401.10578v1.pdf","comment":"13 pages,8 figures"},{"id":"http://arxiv.org/abs/2401.10564v1","updated":"2024-01-19T09:01:20Z","published":"2024-01-19T09:01:20Z","title":"Dream360: Diverse and Immersive Outdoor Virtual Scene Creation via\n Transformer-Based 360 Image Outpainting","summary":" 360 images, with a field-of-view (FoV) of 180x360, provide immersive and\nrealistic environments for emerging virtual reality (VR) applications, such as\nvirtual tourism, where users desire to create diverse panoramic scenes from a\nnarrow FoV photo they take from a viewpoint via portable devices. It thus\nbrings us to a technical challenge: `How to allow the users to freely create\ndiverse and immersive virtual scenes from a narrow FoV image with a specified\nviewport?' To this end, we propose a transformer-based 360 image outpainting\nframework called Dream360, which can generate diverse, high-fidelity, and\nhigh-resolution panoramas from user-selected viewports, considering the\nspherical properties of 360 images. Compared with existing methods, e.g., [3],\nwhich primarily focus on inputs with rectangular masks and central locations\nwhile overlooking the spherical property of 360 images, our Dream360 offers\nhigher outpainting flexibility and fidelity based on the spherical\nrepresentation. Dream360 comprises two key learning stages: (I) codebook-based\npanorama outpainting via Spherical-VQGAN (S-VQGAN), and (II) frequency-aware\nrefinement with a novel frequency-aware consistency loss. Specifically, S-VQGAN\nlearns a sphere-specific codebook from spherical harmonic (SH) values,\nproviding a better representation of spherical data distribution for scene\nmodeling. The frequency-aware refinement matches the resolution and further\nimproves the semantic consistency and visual fidelity of the generated results.\nOur Dream360 achieves significantly lower Frechet Inception Distance (FID)\nscores and better visual fidelity than existing methods. We also conducted a\nuser study involving 15 participants to interactively evaluate the quality of\nthe generated results in VR, demonstrating the flexibility and superiority of\nour Dream360 framework.\n","authors":["Hao Ai","Zidong Cao","Haonan Lu","Chen Chen","Jian Ma","Pengyuan Zhou","Tae-Kyun Kim","Pan Hui","Lin Wang"],"pdf_url":"https://arxiv.org/pdf/2401.10564v1.pdf","comment":"11 pages, accepted to IEEE VR 2024"},{"id":"http://arxiv.org/abs/2401.10561v1","updated":"2024-01-19T08:54:54Z","published":"2024-01-19T08:54:54Z","title":"MAEDiff: Masked Autoencoder-enhanced Diffusion Models for Unsupervised\n Anomaly Detection in Brain Images","summary":" Unsupervised anomaly detection has gained significant attention in the field\nof medical imaging due to its capability of relieving the costly pixel-level\nannotation. To achieve this, modern approaches usually utilize generative\nmodels to produce healthy references of the diseased images and then identify\nthe abnormalities by comparing the healthy references and the original diseased\nimages. Recently, diffusion models have exhibited promising potential for\nunsupervised anomaly detection in medical images for their good mode coverage\nand high sample quality. However, the intrinsic characteristics of the medical\nimages, e.g. the low contrast, and the intricate anatomical structure of the\nhuman body make the reconstruction challenging. Besides, the global information\nof medical images often remain underutilized. To address these two issues, we\npropose a novel Masked Autoencoder-enhanced Diffusion Model (MAEDiff) for\nunsupervised anomaly detection in brain images. The MAEDiff involves a\nhierarchical patch partition. It generates healthy images by overlapping\nupper-level patches and implements a mechanism based on the masked autoencoders\noperating on the sub-level patches to enhance the condition on the unnoised\nregions. Extensive experiments on data of tumors and multiple sclerosis lesions\ndemonstrate the effectiveness of our method.\n","authors":["Rui Xu","Yunke Wang","Bo Du"],"pdf_url":"https://arxiv.org/pdf/2401.10561v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10560v1","updated":"2024-01-19T08:52:24Z","published":"2024-01-19T08:52:24Z","title":"360ORB-SLAM: A Visual SLAM System for Panoramic Images with Depth\n Completion Network","summary":" To enhance the performance and effect of AR/VR applications and visual\nassistance and inspection systems, visual simultaneous localization and mapping\n(vSLAM) is a fundamental task in computer vision and robotics. However,\ntraditional vSLAM systems are limited by the camera's narrow field-of-view,\nresulting in challenges such as sparse feature distribution and lack of dense\ndepth information. To overcome these limitations, this paper proposes a\n360ORB-SLAM system for panoramic images that combines with a depth completion\nnetwork. The system extracts feature points from the panoramic image, utilizes\na panoramic triangulation module to generate sparse depth information, and\nemploys a depth completion network to obtain a dense panoramic depth map.\nExperimental results on our novel panoramic dataset constructed based on Carla\ndemonstrate that the proposed method achieves superior scale accuracy compared\nto existing monocular SLAM methods and effectively addresses the challenges of\nfeature association and scale ambiguity. The integration of the depth\ncompletion network enhances system stability and mitigates the impact of\ndynamic elements on SLAM performance.\n","authors":["Yichen Chen","Yiqi Pan","Ruyu Liu","Haoyu Zhang","Guodao Zhang","Bo Sun","Jianhua Zhang"],"pdf_url":"https://arxiv.org/pdf/2401.10560v1.pdf","comment":"6 pages, 9 figures"},{"id":"http://arxiv.org/abs/2309.02119v3","updated":"2024-01-19T08:50:28Z","published":"2023-09-05T10:52:21Z","title":"Hierarchical Masked 3D Diffusion Model for Video Outpainting","summary":" Video outpainting aims to adequately complete missing areas at the edges of\nvideo frames. Compared to image outpainting, it presents an additional\nchallenge as the model should maintain the temporal consistency of the filled\narea. In this paper, we introduce a masked 3D diffusion model for video\noutpainting. We use the technique of mask modeling to train the 3D diffusion\nmodel. This allows us to use multiple guide frames to connect the results of\nmultiple video clip inferences, thus ensuring temporal consistency and reducing\njitter between adjacent frames. Meanwhile, we extract the global frames of the\nvideo as prompts and guide the model to obtain information other than the\ncurrent video clip using cross-attention. We also introduce a hybrid\ncoarse-to-fine inference pipeline to alleviate the artifact accumulation\nproblem. The existing coarse-to-fine pipeline only uses the infilling strategy,\nwhich brings degradation because the time interval of the sparse frames is too\nlarge. Our pipeline benefits from bidirectional learning of the mask modeling\nand thus can employ a hybrid strategy of infilling and interpolation when\ngenerating sparse frames. Experiments show that our method achieves\nstate-of-the-art results in video outpainting tasks. More results and codes are\nprovided at our https://fanfanda.github.io/M3DDM/.\n","authors":["Fanda Fan","Chaoxu Guo","Litong Gong","Biao Wang","Tiezheng Ge","Yuning Jiang","Chunjie Luo","Jianfeng Zhan"],"pdf_url":"https://arxiv.org/pdf/2309.02119v3.pdf","comment":"Accepted to ACM MM 2023"},{"id":"http://arxiv.org/abs/2401.10556v1","updated":"2024-01-19T08:44:52Z","published":"2024-01-19T08:44:52Z","title":"Symbol as Points: Panoptic Symbol Spotting via Point-based\n Representation","summary":" This work studies the problem of panoptic symbol spotting, which is to spot\nand parse both countable object instances (windows, doors, tables, etc.) and\nuncountable stuff (wall, railing, etc.) from computer-aided design (CAD)\ndrawings. Existing methods typically involve either rasterizing the vector\ngraphics into images and using image-based methods for symbol spotting, or\ndirectly building graphs and using graph neural networks for symbol\nrecognition. In this paper, we take a different approach, which treats graphic\nprimitives as a set of 2D points that are locally connected and use point cloud\nsegmentation methods to tackle it. Specifically, we utilize a point transformer\nto extract the primitive features and append a mask2former-like spotting head\nto predict the final output. To better use the local connection information of\nprimitives and enhance their discriminability, we further propose the attention\nwith connection module (ACM) and contrastive connection learning scheme (CCL).\nFinally, we propose a KNN interpolation mechanism for the mask attention module\nof the spotting head to better handle primitive mask downsampling, which is\nprimitive-level in contrast to pixel-level for the image. Our approach, named\nSymPoint, is simple yet effective, outperforming recent state-of-the-art method\nGAT-CADNet by an absolute increase of 9.6% PQ and 10.4% RQ on the FloorPlanCAD\ndataset. The source code and models will be available at\nhttps://github.com/nicehuster/SymPoint.\n","authors":["Wenlong Liu","Tianyu Yang","Yuhan Wang","Qizhi Yu","Lei Zhang"],"pdf_url":"https://arxiv.org/pdf/2401.10556v1.pdf","comment":"ICLR 2024"},{"id":"http://arxiv.org/abs/2309.02773v2","updated":"2024-01-19T08:01:15Z","published":"2023-09-06T06:31:08Z","title":"Diffusion Model is Secretly a Training-free Open Vocabulary Semantic\n Segmenter","summary":" The pre-trained text-image discriminative models, such as CLIP, has been\nexplored for open-vocabulary semantic segmentation with unsatisfactory results\ndue to the loss of crucial localization information and awareness of object\nshapes. Recently, there has been a growing interest in expanding the\napplication of generative models from generation tasks to semantic\nsegmentation. These approaches utilize generative models either for generating\nannotated data or extracting features to facilitate semantic segmentation. This\ntypically involves generating a considerable amount of synthetic data or\nrequiring additional mask annotations. To this end, we uncover the potential of\ngenerative text-to-image diffusion models (e.g., Stable Diffusion) as highly\nefficient open-vocabulary semantic segmenters, and introduce a novel\ntraining-free approach named DiffSegmenter. The insight is that to generate\nrealistic objects that are semantically faithful to the input text, both the\ncomplete object shapes and the corresponding semantics are implicitly learned\nby diffusion models. We discover that the object shapes are characterized by\nthe self-attention maps while the semantics are indicated through the\ncross-attention maps produced by the denoising U-Net, forming the basis of our\nsegmentation results.Additionally, we carefully design effective textual\nprompts and a category filtering mechanism to further enhance the segmentation\nresults. Extensive experiments on three benchmark datasets show that the\nproposed DiffSegmenter achieves impressive results for open-vocabulary semantic\nsegmentation.\n","authors":["Jinglong Wang","Xiawei Li","Jing Zhang","Qingyuan Xu","Qin Zhou","Qian Yu","Lu Sheng","Dong Xu"],"pdf_url":"https://arxiv.org/pdf/2309.02773v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10541v1","updated":"2024-01-19T07:44:32Z","published":"2024-01-19T07:44:32Z","title":"I-SplitEE: Image classification in Split Computing DNNs with Early Exits","summary":" The recent advances in Deep Neural Networks (DNNs) stem from their\nexceptional performance across various domains. However, their inherent large\nsize hinders deploying these networks on resource-constrained devices like\nedge, mobile, and IoT platforms. Strategies have emerged, from partial cloud\ncomputation offloading (split computing) to integrating early exits within DNN\nlayers. Our work presents an innovative unified approach merging early exits\nand split computing. We determine the 'splitting layer', the optimal depth in\nthe DNN for edge device computations, and whether to infer on edge device or be\noffloaded to the cloud for inference considering accuracy, computational\nefficiency, and communication costs. Also, Image classification faces diverse\nenvironmental distortions, influenced by factors like time of day, lighting,\nand weather. To adapt to these distortions, we introduce I-SplitEE, an online\nunsupervised algorithm ideal for scenarios lacking ground truths and with\nsequential data. Experimental validation using Caltech-256 and Cifar-10\ndatasets subjected to varied distortions showcases I-SplitEE's ability to\nreduce costs by a minimum of 55% with marginal performance degradation of at\nmost 5%.\n","authors":["Divya Jyoti Bajpai","Aastha Jaiswal","Manjesh Kumar Hanawal"],"pdf_url":"https://arxiv.org/pdf/2401.10541v1.pdf","comment":"To appear in proceedings of IEEE International Conference on\n Communications 2024"},{"id":"http://arxiv.org/abs/2401.10537v1","updated":"2024-01-19T07:31:44Z","published":"2024-01-19T07:31:44Z","title":"Learning Position-Aware Implicit Neural Network for Real-World Face\n Inpainting","summary":" Face inpainting requires the model to have a precise global understanding of\nthe facial position structure. Benefiting from the powerful capabilities of\ndeep learning backbones, recent works in face inpainting have achieved decent\nperformance in ideal setting (square shape with $512px$). However, existing\nmethods often produce a visually unpleasant result, especially in the\nposition-sensitive details (e.g., eyes and nose), when directly applied to\narbitrary-shaped images in real-world scenarios. The visually unpleasant\nposition-sensitive details indicate the shortcomings of existing methods in\nterms of position information processing capability. In this paper, we propose\nan \\textbf{I}mplicit \\textbf{N}eural \\textbf{I}npainting \\textbf{N}etwork\n(IN$^2$) to handle arbitrary-shape face images in real-world scenarios by\nexplicit modeling for position information. Specifically, a downsample\nprocessing encoder is proposed to reduce information loss while obtaining the\nglobal semantic feature. A neighbor hybrid attention block is proposed with a\nhybrid attention mechanism to improve the facial understanding ability of the\nmodel without restricting the shape of the input. Finally, an implicit neural\npyramid decoder is introduced to explicitly model position information and\nbridge the gap between low-resolution features and high-resolution output.\nExtensive experiments demonstrate the superiority of the proposed method in\nreal-world face inpainting task.\n","authors":["Bo Zhao","Huan Yang","Jianlong Fu"],"pdf_url":"https://arxiv.org/pdf/2401.10537v1.pdf","comment":"10 pages, 5 figures"},{"id":"http://arxiv.org/abs/2312.16451v3","updated":"2024-01-19T07:27:18Z","published":"2023-12-27T07:35:17Z","title":"Domain Generalization with Vital Phase Augmentation","summary":" Deep neural networks have shown remarkable performance in image\nclassification. However, their performance significantly deteriorates with\ncorrupted input data. Domain generalization methods have been proposed to train\nrobust models against out-of-distribution data. Data augmentation in the\nfrequency domain is one of such approaches that enable a model to learn phase\nfeatures to establish domain-invariant representations. This approach changes\nthe amplitudes of the input data while preserving the phases. However, using\nfixed phases leads to susceptibility to phase fluctuations because amplitudes\nand phase fluctuations commonly occur in out-of-distribution. In this study, to\naddress this problem, we introduce an approach using finite variation of the\nphases of input data rather than maintaining fixed phases. Based on the\nassumption that the degree of domain-invariant features varies for each phase,\nwe propose a method to distinguish phases based on this degree. In addition, we\npropose a method called vital phase augmentation (VIPAug) that applies the\nvariation to the phases differently according to the degree of domain-invariant\nfeatures of given phases. The model depends more on the vital phases that\ncontain more domain-invariant features for attaining robustness to amplitude\nand phase fluctuations. We present experimental evaluations of our proposed\napproach, which exhibited improved performance for both clean and corrupted\ndata. VIPAug achieved SOTA performance on the benchmark CIFAR-10 and CIFAR-100\ndatasets, as well as near-SOTA performance on the ImageNet-100 and ImageNet\ndatasets. Our code is available at https://github.com/excitedkid/vipaug.\n","authors":["Ingyun Lee","Wooju Lee","Hyun Myung"],"pdf_url":"https://arxiv.org/pdf/2312.16451v3.pdf","comment":"Accepted by AAAI-24"},{"id":"http://arxiv.org/abs/2309.06023v4","updated":"2024-01-19T07:22:30Z","published":"2023-09-12T07:50:54Z","title":"Learning from History: Task-agnostic Model Contrastive Learning for\n Image Restoration","summary":" Contrastive learning has emerged as a prevailing paradigm for high-level\nvision tasks, which, by introducing properly negative samples, has also been\nexploited for low-level vision tasks to achieve a compact optimization space to\naccount for their ill-posed nature. However, existing methods rely on manually\npredefined and task-oriented negatives, which often exhibit pronounced\ntask-specific biases. To address this challenge, our paper introduces an\ninnovative method termed 'learning from history', which dynamically generates\nnegative samples from the target model itself. Our approach, named Model\nContrastive paradigm for Image Restoration (MCIR), rejuvenates latency models\nas negative models, making it compatible with diverse image restoration tasks.\nWe propose the Self-Prior guided Negative loss (SPN) to enable it. This\napproach significantly enhances existing models when retrained with the\nproposed model contrastive paradigm. The results show significant improvements\nin image restoration across various tasks and architectures. For example,\nmodels retrained with SPN outperform the original FFANet and DehazeFormer by\n3.41 dB and 0.57 dB on the RESIDE indoor dataset for image dehazing. Similarly,\nthey achieve notable improvements of 0.47 dB on SPA-Data over IDT for image\nderaining and 0.12 dB on Manga109 for a 4x scale super-resolution over\nlightweight SwinIR, respectively. Code and retrained models are available at\nhttps://github.com/Aitical/MCIR.\n","authors":["Gang Wu","Junjun Jiang","Kui Jiang","Xianming Liu"],"pdf_url":"https://arxiv.org/pdf/2309.06023v4.pdf","comment":"Camera Ready Version. Accepted to The 38th Annual AAAI Conference on\n Artificial Intelligence (AAAI 2024)"},{"id":"http://arxiv.org/abs/2401.10530v1","updated":"2024-01-19T07:12:36Z","published":"2024-01-19T07:12:36Z","title":"NWPU-MOC: A Benchmark for Fine-grained Multi-category Object Counting in\n Aerial Images","summary":" Object counting is a hot topic in computer vision, which aims to estimate the\nnumber of objects in a given image. However, most methods only count objects of\na single category for an image, which cannot be applied to scenes that need to\ncount objects with multiple categories simultaneously, especially in aerial\nscenes. To this end, this paper introduces a Multi-category Object Counting\n(MOC) task to estimate the numbers of different objects (cars, buildings,\nships, etc.) in an aerial image. Considering the absence of a dataset for this\ntask, a large-scale Dataset (NWPU-MOC) is collected, consisting of 3,416 scenes\nwith a resolution of 1024 $\\times$ 1024 pixels, and well-annotated using 14\nfine-grained object categories. Besides, each scene contains RGB and Near\nInfrared (NIR) images, of which the NIR spectrum can provide richer\ncharacterization information compared with only the RGB spectrum. Based on\nNWPU-MOC, the paper presents a multi-spectrum, multi-category object counting\nframework, which employs a dual-attention module to fuse the features of RGB\nand NIR and subsequently regress multi-channel density maps corresponding to\neach object category. In addition, to modeling the dependency between different\nchannels in the density map with each object category, a spatial contrast loss\nis designed as a penalty for overlapping predictions at the same spatial\nposition. Experimental results demonstrate that the proposed method achieves\nstate-of-the-art performance compared with some mainstream counting algorithms.\nThe dataset, code and models are publicly available at\nhttps://github.com/lyongo/NWPU-MOC.\n","authors":["Junyu Gao","Liangliang Zhao","Xuelong Li"],"pdf_url":"https://arxiv.org/pdf/2401.10530v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10529v1","updated":"2024-01-19T07:10:13Z","published":"2024-01-19T07:10:13Z","title":"Mementos: A Comprehensive Benchmark for Multimodal Large Language Model\n Reasoning over Image Sequences","summary":" Multimodal Large Language Models (MLLMs) have demonstrated proficiency in\nhandling a variety of visual-language tasks. However, current MLLM benchmarks\nare predominantly designed to evaluate reasoning based on static information\nabout a single image, and the ability of modern MLLMs to extrapolate from image\nsequences, which is essential for understanding our ever-changing world, has\nbeen less investigated. To address this challenge, this paper introduces\nMementos, a new benchmark designed to assess MLLMs' sequential image reasoning\nabilities. Mementos features 4,761 diverse image sequences with varying\nlengths. We also employ a GPT-4 assisted method to evaluate MLLM reasoning\nperformance. Through a careful evaluation of nine recent MLLMs on Mementos,\nincluding GPT-4V and Gemini, we find that they struggle to accurately describe\ndynamic information about given image sequences, often leading to\nhallucinations/misrepresentations of objects and their corresponding behaviors.\nOur quantitative analysis and case studies identify three key factors impacting\nMLLMs' sequential image reasoning: the correlation between object and\nbehavioral hallucinations, the influence of cooccurring behaviors, and the\ncompounding impact of behavioral hallucinations. Our dataset is available at\nhttps://github.com/umd-huang-lab/Mementos.\n","authors":["Xiyao Wang","Yuhang Zhou","Xiaoyu Liu","Hongjin Lu","Yuancheng Xu","Feihong He","Jaehong Yoon","Taixi Lu","Gedas Bertasius","Mohit Bansal","Huaxiu Yao","Furong Huang"],"pdf_url":"https://arxiv.org/pdf/2401.10529v1.pdf","comment":"27 pages, 23 figures"},{"id":"http://arxiv.org/abs/2401.10526v1","updated":"2024-01-19T07:06:58Z","published":"2024-01-19T07:06:58Z","title":"On mitigating stability-plasticity dilemma in CLIP-guided image morphing\n via geodesic distillation loss","summary":" Large-scale language-vision pre-training models, such as CLIP, have achieved\nremarkable text-guided image morphing results by leveraging several\nunconditional generative models. However, existing CLIP-guided image morphing\nmethods encounter difficulties when morphing photorealistic images.\nSpecifically, existing guidance fails to provide detailed explanations of the\nmorphing regions within the image, leading to misguidance. In this paper, we\nobserved that such misguidance could be effectively mitigated by simply using a\nproper regularization loss. Our approach comprises two key components: 1) a\ngeodesic cosine similarity loss that minimizes inter-modality features (i.e.,\nimage and text) on a projected subspace of CLIP space, and 2) a latent\nregularization loss that minimizes intra-modality features (i.e., image and\nimage) on the image manifold. By replacing the na\\\"ive directional CLIP loss in\na drop-in replacement manner, our method achieves superior morphing results on\nboth images and videos for various benchmarks, including CLIP-inversion.\n","authors":["Yeongtak Oh","Saehyung Lee","Uiwon Hwang","Sungroh Yoon"],"pdf_url":"https://arxiv.org/pdf/2401.10526v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.07567v2","updated":"2024-01-19T07:04:56Z","published":"2024-01-15T09:59:43Z","title":"Bias-Conflict Sample Synthesis and Adversarial Removal Debias Strategy\n for Temporal Sentence Grounding in Video","summary":" Temporal Sentence Grounding in Video (TSGV) is troubled by dataset bias\nissue, which is caused by the uneven temporal distribution of the target\nmoments for samples with similar semantic components in input videos or query\ntexts. Existing methods resort to utilizing prior knowledge about bias to\nartificially break this uneven distribution, which only removes a limited\namount of significant language biases. In this work, we propose the\nbias-conflict sample synthesis and adversarial removal debias strategy\n(BSSARD), which dynamically generates bias-conflict samples by explicitly\nleveraging potentially spurious correlations between single-modality features\nand the temporal position of the target moments. Through adversarial training,\nits bias generators continuously introduce biases and generate bias-conflict\nsamples to deceive its grounding model. Meanwhile, the grounding model\ncontinuously eliminates the introduced biases, which requires it to model\nmulti-modality alignment information. BSSARD will cover most kinds of coupling\nrelationships and disrupt language and visual biases simultaneously. Extensive\nexperiments on Charades-CD and ActivityNet-CD demonstrate the promising\ndebiasing capability of BSSARD. Source codes are available at\nhttps://github.com/qzhb/BSSARD.\n","authors":["Zhaobo Qi","Yibo Yuan","Xiaowen Ruan","Shuhui Wang","Weigang Zhang","Qingming Huang"],"pdf_url":"https://arxiv.org/pdf/2401.07567v2.pdf","comment":"accepted by AAAI 2024"},{"id":"http://arxiv.org/abs/2401.10525v1","updated":"2024-01-19T07:01:07Z","published":"2024-01-19T07:01:07Z","title":"Focaler-IoU: More Focused Intersection over Union Loss","summary":" Bounding box regression plays a crucial role in the field of object\ndetection, and the positioning accuracy of object detection largely depends on\nthe loss function of bounding box regression. Existing researchs improve\nregression performance by utilizing the geometric relationship between bounding\nboxes, while ignoring the impact of difficult and easy sample distribution on\nbounding box regression. In this article, we analyzed the impact of difficult\nand easy sample distribution on regression results, and then proposed\nFocaler-IoU, which can improve detector performance in different detection\ntasks by focusing on different regression samples. Finally, comparative\nexperiments were conducted using existing advanced detectors and regression\nmethods for different detection tasks, and the detection performance was\nfurther improved by using the method proposed in this paper.Code is available\nat \\url{https://github.com/malagoutou/Focaler-IoU}.\n","authors":["Hao Zhang","Shuaijie Zhang"],"pdf_url":"https://arxiv.org/pdf/2401.10525v1.pdf","comment":"arXiv admin note: substantial text overlap with arXiv:2312.17663"},{"id":"http://arxiv.org/abs/2401.10512v1","updated":"2024-01-19T06:04:48Z","published":"2024-01-19T06:04:48Z","title":"Exploring Color Invariance through Image-Level Ensemble Learning","summary":" In the field of computer vision, the persistent presence of color bias,\nresulting from fluctuations in real-world lighting and camera conditions,\npresents a substantial challenge to the robustness of models. This issue is\nparticularly pronounced in complex wide-area surveillance scenarios, such as\nperson re-identification and industrial dust segmentation, where models often\nexperience a decline in performance due to overfitting on color information\nduring training, given the presence of environmental variations. Consequently,\nthere is a need to effectively adapt models to cope with the complexities of\ncamera conditions. To address this challenge, this study introduces a learning\nstrategy named Random Color Erasing, which draws inspiration from ensemble\nlearning. This strategy selectively erases partial or complete color\ninformation in the training data without disrupting the original image\nstructure, thereby achieving a balanced weighting of color features and other\nfeatures within the neural network. This approach mitigates the risk of\noverfitting and enhances the model's ability to handle color variation, thereby\nimproving its overall robustness. The approach we propose serves as an ensemble\nlearning strategy, characterized by robust interpretability. A comprehensive\nanalysis of this methodology is presented in this paper. Across various tasks\nsuch as person re-identification and semantic segmentation, our approach\nconsistently improves strong baseline methods. Notably, in comparison to\nexisting methods that prioritize color robustness, our strategy significantly\nenhances performance in cross-domain scenarios. The code available at\n\\url{https://github.com/layumi/Person\\_reID\\_baseline\\_pytorch/blob/master/random\\_erasing.py}\nor \\url{https://github.com/finger-monkey/Data-Augmentation}.\n","authors":["Yunpeng Gong","Jiaquan Li","Lifei Chen","Min Jiang"],"pdf_url":"https://arxiv.org/pdf/2401.10512v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10511v1","updated":"2024-01-19T06:03:01Z","published":"2024-01-19T06:03:01Z","title":"GMC-IQA: Exploiting Global-correlation and Mean-opinion Consistency for\n No-reference Image Quality Assessment","summary":" Due to the subjective nature of image quality assessment (IQA), assessing\nwhich image has better quality among a sequence of images is more reliable than\nassigning an absolute mean opinion score for an image. Thus, IQA models are\nevaluated by global correlation consistency (GCC) metrics like PLCC and SROCC,\nrather than mean opinion consistency (MOC) metrics like MAE and MSE. However,\nmost existing methods adopt MOC metrics to define their loss functions, due to\nthe infeasible computation of GCC metrics during training. In this work, we\nconstruct a novel loss function and network to exploit Global-correlation and\nMean-opinion Consistency, forming a GMC-IQA framework. Specifically, we propose\na novel GCC loss by defining a pairwise preference-based rank estimation to\nsolve the non-differentiable problem of SROCC and introducing a queue mechanism\nto reserve previous data to approximate the global results of the whole data.\nMoreover, we propose a mean-opinion network, which integrates diverse opinion\nfeatures to alleviate the randomness of weight learning and enhance the model\nrobustness. Experiments indicate that our method outperforms SOTA methods on\nmultiple authentic datasets with higher accuracy and generalization. We also\nadapt the proposed loss to various networks, which brings better performance\nand more stable training.\n","authors":["Zewen Chen","Juan Wang","Bing Li","Chunfeng Yuan","Weiming Hu","Junxian Liu","Peng Li","Yan Wang","Youqun Zhang","Congxuan Zhang"],"pdf_url":"https://arxiv.org/pdf/2401.10511v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.05594v3","updated":"2024-01-19T05:50:58Z","published":"2024-01-10T23:55:16Z","title":"Wasserstein Distance-based Expansion of Low-Density Latent Regions for\n Unknown Class Detection","summary":" This paper addresses the significant challenge in open-set object detection\n(OSOD): the tendency of state-of-the-art detectors to erroneously classify\nunknown objects as known categories with high confidence. We present a novel\napproach that effectively identifies unknown objects by distinguishing between\nhigh and low-density regions in latent space. Our method builds upon the\nOpen-Det (OD) framework, introducing two new elements to the loss function.\nThese elements enhance the known embedding space's clustering and expand the\nunknown space's low-density regions. The first addition is the Class\nWasserstein Anchor (CWA), a new function that refines the classification\nboundaries. The second is a spectral normalisation step, improving the\nrobustness of the model. Together, these augmentations to the existing\nContrastive Feature Learner (CFL) and Unknown Probability Learner (UPL) loss\nfunctions significantly improve OSOD performance. Our proposed OpenDet-CWA\n(OD-CWA) method demonstrates: a) a reduction in open-set errors by\napproximately 17%-22%, b) an enhancement in novelty detection capability by\n1.5%-16%, and c) a decrease in the wilderness index by 2%-20% across various\nopen-set scenarios. These results represent a substantial advancement in the\nfield, showcasing the potential of our approach in managing the complexities of\nopen-set object detection.\n","authors":["Prakash Mallick","Feras Dayoub","Jamie Sherrah"],"pdf_url":"https://arxiv.org/pdf/2401.05594v3.pdf","comment":"8 Full length pages, followed by 2 supplementary pages, total of 9\n Figures"},{"id":"http://arxiv.org/abs/2208.09424v3","updated":"2024-01-19T05:32:54Z","published":"2022-08-19T16:16:59Z","title":"Hierarchical Compositional Representations for Few-shot Action\n Recognition","summary":" Recently action recognition has received more and more attention for its\ncomprehensive and practical applications in intelligent surveillance and\nhuman-computer interaction. However, few-shot action recognition has not been\nwell explored and remains challenging because of data scarcity. In this paper,\nwe propose a novel hierarchical compositional representations (HCR) learning\napproach for few-shot action recognition. Specifically, we divide a complicated\naction into several sub-actions by carefully designed hierarchical clustering\nand further decompose the sub-actions into more fine-grained spatially\nattentional sub-actions (SAS-actions). Although there exist large differences\nbetween base classes and novel classes, they can share similar patterns in\nsub-actions or SAS-actions. Furthermore, we adopt the Earth Mover's Distance in\nthe transportation problem to measure the similarity between video samples in\nterms of sub-action representations. It computes the optimal matching flows\nbetween sub-actions as distance metric, which is favorable for comparing\nfine-grained patterns. Extensive experiments show our method achieves the\nstate-of-the-art results on HMDB51, UCF101 and Kinetics datasets.\n","authors":["Changzhen Li","Jie Zhang","Shuzhe Wu","Xin Jin","Shiguang Shan"],"pdf_url":"https://arxiv.org/pdf/2208.09424v3.pdf","comment":"Accepted by Computer Vision and Image Understanding"},{"id":"http://arxiv.org/abs/2401.10501v1","updated":"2024-01-19T05:28:51Z","published":"2024-01-19T05:28:51Z","title":"Enhancing medical vision-language contrastive learning via\n inter-matching relation modelling","summary":" Medical image representations can be learned through medical vision-language\ncontrastive learning (mVLCL) where medical imaging reports are used as weak\nsupervision through image-text alignment. These learned image representations\ncan be transferred to and benefit various downstream medical vision tasks such\nas disease classification and segmentation. Recent mVLCL methods attempt to\nalign image sub-regions and the report keywords as local-matchings. However,\nthese methods aggregate all local-matchings via simple pooling operations while\nignoring the inherent relations between them. These methods therefore fail to\nreason between local-matchings that are semantically related, e.g.,\nlocal-matchings that correspond to the disease word and the location word\n(semantic-relations), and also fail to differentiate such clinically important\nlocal-matchings from others that correspond to less meaningful words, e.g.,\nconjunction words (importance-relations). Hence, we propose a mVLCL method that\nmodels the inter-matching relations between local-matchings via a\nrelation-enhanced contrastive learning framework (RECLF). In RECLF, we\nintroduce a semantic-relation reasoning module (SRM) and an importance-relation\nreasoning module (IRM) to enable more fine-grained report supervision for image\nrepresentation learning. We evaluated our method using four public benchmark\ndatasets on four downstream tasks, including segmentation, zero-shot\nclassification, supervised classification, and cross-modal retrieval. Our\nresults demonstrated the superiority of our RECLF over the state-of-the-art\nmVLCL methods with consistent improvements across single-modal and cross-modal\ntasks. These results suggest that our RECLF, by modelling the inter-matching\nrelations, can learn improved medical image representations with better\ngeneralization capabilities.\n","authors":["Mingjian Li","Mingyuan Meng","Michael Fulham","David Dagan Feng","Lei Bi","Jinman Kim"],"pdf_url":"https://arxiv.org/pdf/2401.10501v1.pdf","comment":"11 pages, 5 figures. Under review"},{"id":"http://arxiv.org/abs/2401.09895v2","updated":"2024-01-19T05:27:15Z","published":"2024-01-18T11:14:32Z","title":"Skeleton-Guided Instance Separation for Fine-Grained Segmentation in\n Microscopy","summary":" One of the fundamental challenges in microscopy (MS) image analysis is\ninstance segmentation (IS), particularly when segmenting cluster regions where\nmultiple objects of varying sizes and shapes may be connected or even\noverlapped in arbitrary orientations. Existing IS methods usually fail in\nhandling such scenarios, as they rely on coarse instance representations such\nas keypoints and horizontal bounding boxes (h-bboxes). In this paper, we\npropose a novel one-stage framework named A2B-IS to address this challenge and\nenhance the accuracy of IS in MS images. Our approach represents each instance\nwith a pixel-level mask map and a rotated bounding box (r-bbox). Unlike\ntwo-stage methods that use box proposals for segmentations, our method\ndecouples mask and box predictions, enabling simultaneous processing to\nstreamline the model pipeline. Additionally, we introduce a Gaussian skeleton\nmap to aid the IS task in two key ways: (1) It guides anchor placement,\nreducing computational costs while improving the model's capacity to learn\nRoI-aware features by filtering out noise from background regions. (2) It\nensures accurate isolation of densely packed instances by rectifying erroneous\nbox predictions near instance boundaries. To further enhance the performance,\nwe integrate two modules into the framework: (1) An Atrous Attention Block\n(A2B) designed to extract high-resolution feature maps with fine-grained\nmultiscale information, and (2) A Semi-Supervised Learning (SSL) strategy that\nleverages both labeled and unlabeled images for model training. Our method has\nbeen thoroughly validated on two large-scale MS datasets, demonstrating its\nsuperiority over most state-of-the-art approaches.\n","authors":["Jun Wang","Chengfeng Zhou","Zhaoyan Ming","Lina Wei","Xudong Jiang","Dahong Qian"],"pdf_url":"https://arxiv.org/pdf/2401.09895v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12653v2","updated":"2024-01-19T04:37:18Z","published":"2023-12-19T22:53:32Z","title":"Diagnosis Of Takotsubo Syndrome By Robust Feature Selection From The\n Complex Latent Space Of DL-based Segmentation Network","summary":" Researchers have shown significant correlations among segmented objects in\nvarious medical imaging modalities and disease related pathologies. Several\nstudies showed that using hand crafted features for disease prediction neglects\nthe immense possibility to use latent features from deep learning (DL) models\nwhich may reduce the overall accuracy of differential diagnosis. However,\ndirectly using classification or segmentation models on medical to learn latent\nfeatures opt out robust feature selection and may lead to overfitting. To fill\nthis gap, we propose a novel feature selection technique using the latent space\nof a segmentation model that can aid diagnosis. We evaluated our method in\ndifferentiating a rare cardiac disease: Takotsubo Syndrome (TTS) from the ST\nelevation myocardial infarction (STEMI) using echocardiogram videos (echo). TTS\ncan mimic clinical features of STEMI in echo and extremely hard to distinguish.\nOur approach shows promising results in differential diagnosis of TTS with 82%\ndiagnosis accuracy beating the previous state-of-the-art (SOTA) approach.\nMoreover, the robust feature selection technique using LASSO algorithm shows\ngreat potential in reducing the redundant features and creates a robust\npipeline for short- and long-term disease prognoses in the downstream analysis.\n","authors":["Fahim Ahmed Zaman","Wahidul Alam","Tarun Kanti Roy","Amanda Chang","Kan Liu","Xiaodong Wu"],"pdf_url":"https://arxiv.org/pdf/2312.12653v2.pdf","comment":"5 pages, 3 figures, conference"},{"id":"http://arxiv.org/abs/2401.10150v2","updated":"2024-01-19T04:27:05Z","published":"2024-01-18T17:22:37Z","title":"Motion-Zero: Zero-Shot Moving Object Control Framework for\n Diffusion-Based Video Generation","summary":" Recent large-scale pre-trained diffusion models have demonstrated a powerful\ngenerative ability to produce high-quality videos from detailed text\ndescriptions. However, exerting control over the motion of objects in videos\ngenerated by any video diffusion model is a challenging problem. In this paper,\nwe propose a novel zero-shot moving object trajectory control framework,\nMotion-Zero, to enable a bounding-box-trajectories-controlled text-to-video\ndiffusion model.To this end, an initial noise prior module is designed to\nprovide a position-based prior to improve the stability of the appearance of\nthe moving object and the accuracy of position. In addition, based on the\nattention map of the U-net, spatial constraints are directly applied to the\ndenoising process of diffusion models, which further ensures the positional and\nspatial consistency of moving objects during the inference. Furthermore,\ntemporal consistency is guaranteed with a proposed shift temporal attention\nmechanism. Our method can be flexibly applied to various state-of-the-art video\ndiffusion models without any training process. Extensive experiments\ndemonstrate our proposed method can control the motion trajectories of objects\nand generate high-quality videos.\n","authors":["Changgu Chen","Junwei Shu","Lianggangxu Chen","Gaoqi He","Changbo Wang","Yang Li"],"pdf_url":"https://arxiv.org/pdf/2401.10150v2.pdf","comment":"Preprint"},{"id":"http://arxiv.org/abs/2401.09721v2","updated":"2024-01-19T04:07:33Z","published":"2024-01-18T04:51:41Z","title":"Fast graph-based denoising for point cloud color information","summary":" Point clouds are utilized in various 3D applications such as cross-reality\n(XR) and realistic 3D displays. In some applications, e.g., for live streaming\nusing a 3D point cloud, real-time point cloud denoising methods are required to\nenhance the visual quality. However, conventional high-precision denoising\nmethods cannot be executed in real time for large-scale point clouds owing to\nthe complexity of graph constructions with K nearest neighbors and noise level\nestimation. This paper proposes a fast graph-based denoising (FGBD) for a\nlarge-scale point cloud. First, high-speed graph construction is achieved by\nscanning a point cloud in various directions and searching adjacent\nneighborhoods on the scanning lines. Second, we propose a fast noise level\nestimation method using eigenvalues of the covariance matrix on a graph.\nFinally, we also propose a new low-cost filter selection method to enhance\ndenoising accuracy to compensate for the degradation caused by the acceleration\nalgorithms. In our experiments, we succeeded in reducing the processing time\ndramatically while maintaining accuracy relative to conventional denoising\nmethods. Denoising was performed at 30fps, with frames containing approximately\n1 million points.\n","authors":["Ryosuke Watanabe","Keisuke Nonaka","Eduardo Pavez","Tatsuya Kobayashi","Antonio Ortega"],"pdf_url":"https://arxiv.org/pdf/2401.09721v2.pdf","comment":"Published in the proceeding of 2024 IEEE International Conference on\n Acoustics, Speech and Signal Processing (ICASSP 2024)"},{"id":"http://arxiv.org/abs/2401.10475v1","updated":"2024-01-19T03:54:58Z","published":"2024-01-19T03:54:58Z","title":"CBVS: A Large-Scale Chinese Image-Text Benchmark for Real-World Short\n Video Search Scenarios","summary":" Vision-Language Models pre-trained on large-scale image-text datasets have\nshown superior performance in downstream tasks such as image retrieval. Most of\nthe images for pre-training are presented in the form of open domain\ncommon-sense visual elements. Differently, video covers in short video search\nscenarios are presented as user-originated contents that provide important\nvisual summaries of videos. In addition, a portion of the video covers come\nwith manually designed cover texts that provide semantic complements. In order\nto fill in the gaps in short video cover data, we establish the first\nlarge-scale cover-text benchmark for Chinese short video search scenarios.\nSpecifically, we release two large-scale datasets CBVS-5M/10M to provide short\nvideo covers, and the manual fine-labeling dataset CBVS-20K to provide real\nuser queries, which serves as an image-text benchmark test in the Chinese short\nvideo search field. To integrate the semantics of cover text in the case of\nmodality missing, we propose UniCLIP where cover texts play a guiding role\nduring training, however are not relied upon by inference. Extensive evaluation\non CBVS-20K demonstrates the excellent performance of our proposal. UniCLIP has\nbeen deployed to Tencent's online video search systems with hundreds of\nmillions of visits and achieved significant gains. The complete dataset, code\nand checkpoints will be available upon release.\n","authors":["Xiangshuo Qiao","Xianxin Li","Xiaozhe Qu","Jie Zhang","Yang Liu","Yu Luo","Cihang Jin","Jin Ma"],"pdf_url":"https://arxiv.org/pdf/2401.10475v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10474v1","updated":"2024-01-19T03:50:19Z","published":"2024-01-19T03:50:19Z","title":"LDReg: Local Dimensionality Regularized Self-Supervised Learning","summary":" Representations learned via self-supervised learning (SSL) can be susceptible\nto dimensional collapse, where the learned representation subspace is of\nextremely low dimensionality and thus fails to represent the full data\ndistribution and modalities. Dimensional collapse also known as the\n\"underfilling\" phenomenon is one of the major causes of degraded performance on\ndownstream tasks. Previous work has investigated the dimensional collapse\nproblem of SSL at a global level. In this paper, we demonstrate that\nrepresentations can span over high dimensional space globally, but collapse\nlocally. To address this, we propose a method called $\\textit{local\ndimensionality regularization (LDReg)}$. Our formulation is based on the\nderivation of the Fisher-Rao metric to compare and optimize local distance\ndistributions at an asymptotically small radius for each data point. By\nincreasing the local intrinsic dimensionality, we demonstrate through a range\nof experiments that LDReg improves the representation quality of SSL. The\nresults also show that LDReg can regularize dimensionality at both local and\nglobal levels.\n","authors":["Hanxun Huang","Ricardo J. G. B. Campello","Sarah Monazam Erfani","Xingjun Ma","Michael E. Houle","James Bailey"],"pdf_url":"https://arxiv.org/pdf/2401.10474v1.pdf","comment":"ICLR 2024"},{"id":"http://arxiv.org/abs/2309.09466v2","updated":"2024-01-19T03:37:57Z","published":"2023-09-18T04:01:25Z","title":"Progressive Text-to-Image Diffusion with Soft Latent Direction","summary":" In spite of the rapidly evolving landscape of text-to-image generation, the\nsynthesis and manipulation of multiple entities while adhering to specific\nrelational constraints pose enduring challenges. This paper introduces an\ninnovative progressive synthesis and editing operation that systematically\nincorporates entities into the target image, ensuring their adherence to\nspatial and relational constraints at each sequential step. Our key insight\nstems from the observation that while a pre-trained text-to-image diffusion\nmodel adeptly handles one or two entities, it often falters when dealing with a\ngreater number. To address this limitation, we propose harnessing the\ncapabilities of a Large Language Model (LLM) to decompose intricate and\nprotracted text descriptions into coherent directives adhering to stringent\nformats. To facilitate the execution of directives involving distinct semantic\noperations-namely insertion, editing, and erasing-we formulate the Stimulus,\nResponse, and Fusion (SRF) framework. Within this framework, latent regions are\ngently stimulated in alignment with each operation, followed by the fusion of\nthe responsive latent components to achieve cohesive entity manipulation. Our\nproposed framework yields notable advancements in object synthesis,\nparticularly when confronted with intricate and lengthy textual inputs.\nConsequently, it establishes a new benchmark for text-to-image generation\ntasks, further elevating the field's performance standards.\n","authors":["YuTeng Ye","Jiale Cai","Hang Zhou","Guanwen Li","Youjia Zhang","Zikai Song","Chenxing Gao","Junqing Yu","Wei Yang"],"pdf_url":"https://arxiv.org/pdf/2309.09466v2.pdf","comment":"14 pages, 15 figures"},{"id":"http://arxiv.org/abs/2401.10090v2","updated":"2024-01-19T03:31:49Z","published":"2024-01-18T15:56:23Z","title":"Cross-Modality Perturbation Synergy Attack for Person Re-identification","summary":" In recent years, there has been significant research focusing on addressing\nsecurity concerns in single-modal person re-identification (ReID) systems that\nare based on RGB images. However, the safety of cross-modality scenarios, which\nare more commonly encountered in practical applications involving images\ncaptured by infrared cameras, has not received adequate attention. The main\nchallenge in cross-modality ReID lies in effectively dealing with visual\ndifferences between different modalities. For instance, infrared images are\ntypically grayscale, unlike visible images that contain color information.\nExisting attack methods have primarily focused on the characteristics of the\nvisible image modality, overlooking the features of other modalities and the\nvariations in data distribution among different modalities. This oversight can\npotentially undermine the effectiveness of these methods in image retrieval\nacross diverse modalities. This study represents the first exploration into the\nsecurity of cross-modality ReID models and proposes a universal perturbation\nattack specifically designed for cross-modality ReID. This attack optimizes\nperturbations by leveraging gradients from diverse modality data, thereby\ndisrupting the discriminator and reinforcing the differences between\nmodalities. We conducted experiments on two widely used cross-modality\ndatasets, namely RegDB and SYSU, which not only demonstrated the effectiveness\nof our method but also provided insights for future enhancements in the\nrobustness of cross-modality ReID systems.\n","authors":["Yunpeng Gong","Zhun Zhong","Zhiming Luo","Yansong Qu","Rongrong Ji","Min Jiang"],"pdf_url":"https://arxiv.org/pdf/2401.10090v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10461v1","updated":"2024-01-19T03:01:07Z","published":"2024-01-19T03:01:07Z","title":"Learning to Robustly Reconstruct Low-light Dynamic Scenes from Spike\n Streams","summary":" As a neuromorphic sensor with high temporal resolution, spike camera can\ngenerate continuous binary spike streams to capture per-pixel light intensity.\nWe can use reconstruction methods to restore scene details in high-speed\nscenarios. However, due to limited information in spike streams, low-light\nscenes are difficult to effectively reconstruct. In this paper, we propose a\nbidirectional recurrent-based reconstruction framework, including a\nLight-Robust Representation (LR-Rep) and a fusion module, to better handle such\nextreme conditions. LR-Rep is designed to aggregate temporal information in\nspike streams, and a fusion module is utilized to extract temporal features.\nAdditionally, we have developed a reconstruction benchmark for high-speed\nlow-light scenes. Light sources in the scenes are carefully aligned to\nreal-world conditions. Experimental results demonstrate the superiority of our\nmethod, which also generalizes well to real spike streams. Related codes and\nproposed datasets will be released after publication.\n","authors":["Liwen Hu","Ziluo Ding","Mianzhi Liu","Lei Ma","Tiejun Huang"],"pdf_url":"https://arxiv.org/pdf/2401.10461v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14197v2","updated":"2024-01-19T02:46:00Z","published":"2023-10-22T06:16:16Z","title":"Diffusion-based Data Augmentation for Nuclei Image Segmentation","summary":" Nuclei segmentation is a fundamental but challenging task in the quantitative\nanalysis of histopathology images. Although fully-supervised deep\nlearning-based methods have made significant progress, a large number of\nlabeled images are required to achieve great segmentation performance.\nConsidering that manually labeling all nuclei instances for a dataset is\ninefficient, obtaining a large-scale human-annotated dataset is time-consuming\nand labor-intensive. Therefore, augmenting a dataset with only a few labeled\nimages to improve the segmentation performance is of significant research and\napplication value. In this paper, we introduce the first diffusion-based\naugmentation method for nuclei segmentation. The idea is to synthesize a large\nnumber of labeled images to facilitate training the segmentation model. To\nachieve this, we propose a two-step strategy. In the first step, we train an\nunconditional diffusion model to synthesize the Nuclei Structure that is\ndefined as the representation of pixel-level semantic and distance transform.\nEach synthetic nuclei structure will serve as a constraint on histopathology\nimage synthesis and is further post-processed to be an instance map. In the\nsecond step, we train a conditioned diffusion model to synthesize\nhistopathology images based on nuclei structures. The synthetic histopathology\nimages paired with synthetic instance maps will be added to the real dataset\nfor training the segmentation model. The experimental results show that by\naugmenting 10% labeled real dataset with synthetic samples, one can achieve\ncomparable segmentation results with the fully-supervised baseline. The code is\nreleased in: https://github.com/lhaof/Nudiff\n","authors":["Xinyi Yu","Guanbin Li","Wei Lou","Siqi Liu","Xiang Wan","Yan Chen","Haofeng Li"],"pdf_url":"https://arxiv.org/pdf/2310.14197v2.pdf","comment":"MICCAI 2023, released code: https://github.com/lhaof/Nudiff"},{"id":"http://arxiv.org/abs/2311.15497v3","updated":"2024-01-19T02:45:44Z","published":"2023-11-27T02:48:06Z","title":"Adaptive Image Registration: A Hybrid Approach Integrating Deep Learning\n and Optimization Functions for Enhanced Precision","summary":" Image registration has traditionally been done using two distinct approaches:\nlearning based methods, relying on robust deep neural networks, and\noptimization-based methods, applying complex mathematical transformations to\nwarp images accordingly. Of course, both paradigms offer advantages and\ndisadvantages, and, in this work, we seek to combine their respective strengths\ninto a single streamlined framework, using the outputs of the learning based\nmethod as initial parameters for optimization while prioritizing computational\npower for the image pairs that offer the greatest loss. Our investigations\nshowed improvements of up to 1.6% in test data, while maintaining the same\ninference time, and a substantial 1.0% points performance gain in deformation\nfield smoothness.\n","authors":["Gabriel De Araujo","Shanlin Sun","Xiaohui Xie"],"pdf_url":"https://arxiv.org/pdf/2311.15497v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2208.06551v4","updated":"2024-01-19T02:42:20Z","published":"2022-08-13T02:50:35Z","title":"Exploiting Multiple Sequence Lengths in Fast End to End Training for\n Image Captioning","summary":" We introduce a method called the Expansion mechanism that processes the input\nunconstrained by the number of elements in the sequence. By doing so, the model\ncan learn more effectively compared to traditional attention-based approaches.\nTo support this claim, we design a novel architecture ExpansionNet v2 that\nachieved strong results on the MS COCO 2014 Image Captioning challenge and the\nState of the Art in its respective category, with a score of 143.7 CIDErD in\nthe offline test split, 140.8 CIDErD in the online evaluation server and 72.9\nAllCIDEr on the nocaps validation set. Additionally, we introduce an End to End\ntraining algorithm up to 2.8 times faster than established alternatives. Source\ncode available at: https://github.com/jchenghu/ExpansionNet_v2\n","authors":["Jia Cheng Hu","Roberto Cavicchioli","Alessandro Capotondi"],"pdf_url":"https://arxiv.org/pdf/2208.06551v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10110v2","updated":"2024-01-19T02:31:02Z","published":"2024-01-18T16:27:09Z","title":"VIPTR: A Vision Permutable Extractor for Fast and Efficient Scene Text\n Recognition","summary":" Scene Text Recognition (STR) is a challenging task that involves recognizing\ntext within images of natural scenes. Although current state-of-the-art models\nfor STR exhibit high performance, they typically suffer from low inference\nefficiency due to their reliance on hybrid architectures comprised of visual\nencoders and sequence decoders. In this work, we propose the VIsion Permutable\nextractor for fast and efficient scene Text Recognition (VIPTR), which achieves\nan impressive balance between high performance and rapid inference speeds in\nthe domain of STR. Specifically, VIPTR leverages a visual-semantic extractor\nwith a pyramid structure, characterized by multiple self-attention layers,\nwhile eschewing the traditional sequence decoder. This design choice results in\na lightweight and efficient model capable of handling inputs of varying sizes.\nExtensive experimental results on various standard datasets for both Chinese\nand English scene text recognition validate the superiority of VIPTR. Notably,\nthe VIPTR-T (Tiny) variant delivers highly competitive accuracy on par with\nother lightweight models and achieves SOTA inference speeds. Meanwhile, the\nVIPTR-L (Large) variant attains greater recognition accuracy, while maintaining\na low parameter count and favorable inference speed. Our proposed method\nprovides a compelling solution for the STR challenge, which blends high\naccuracy with efficiency and greatly benefits real-world applications requiring\nfast and reliable text recognition. The code is publicly available at\nhttps://github.com/cxfyxl/VIPTR.\n","authors":["Xianfu Cheng","Weixiao Zhou","Xiang Li","Xiaoming Chen","Jian Yang","Tongliang Li","Zhoujun Li"],"pdf_url":"https://arxiv.org/pdf/2401.10110v2.pdf","comment":"arXiv admin note: text overlap with arXiv:2205.00159 by other authors"},{"id":"http://arxiv.org/abs/2312.06946v2","updated":"2024-01-19T02:08:07Z","published":"2023-12-12T02:55:14Z","title":"WaterHE-NeRF: Water-ray Tracing Neural Radiance Fields for Underwater\n Scene Reconstruction","summary":" Neural Radiance Field (NeRF) technology demonstrates immense potential in\nnovel viewpoint synthesis tasks, due to its physics-based volumetric rendering\nprocess, which is particularly promising in underwater scenes. Addressing the\nlimitations of existing underwater NeRF methods in handling light attenuation\ncaused by the water medium and the lack of real Ground Truth (GT) supervision,\nthis study proposes WaterHE-NeRF. We develop a new water-ray tracing field by\nRetinex theory that precisely encodes color, density, and illuminance\nattenuation in three-dimensional space. WaterHE-NeRF, through its illuminance\nattenuation mechanism, generates both degraded and clear multi-view images and\noptimizes image restoration by combining reconstruction loss with Wasserstein\ndistance. Additionally, the use of histogram equalization (HE) as pseudo-GT\nenhances the network's accuracy in preserving original details and color\ndistribution. Extensive experiments on real underwater datasets and synthetic\ndatasets validate the effectiveness of WaterHE-NeRF. Our code will be made\npublicly available.\n","authors":["Jingchun Zhou","Tianyu Liang","Dehuan Zhang","Zongxin He"],"pdf_url":"https://arxiv.org/pdf/2312.06946v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18999v3","updated":"2024-01-19T01:57:15Z","published":"2023-10-29T12:55:53Z","title":"DynPoint: Dynamic Neural Point For View Synthesis","summary":" The introduction of neural radiance fields has greatly improved the\neffectiveness of view synthesis for monocular videos. However, existing\nalgorithms face difficulties when dealing with uncontrolled or lengthy\nscenarios, and require extensive training time specific to each new scenario.\nTo tackle these limitations, we propose DynPoint, an algorithm designed to\nfacilitate the rapid synthesis of novel views for unconstrained monocular\nvideos. Rather than encoding the entirety of the scenario information into a\nlatent representation, DynPoint concentrates on predicting the explicit 3D\ncorrespondence between neighboring frames to realize information aggregation.\nSpecifically, this correspondence prediction is achieved through the estimation\nof consistent depth and scene flow information across frames. Subsequently, the\nacquired correspondence is utilized to aggregate information from multiple\nreference frames to a target frame, by constructing hierarchical neural point\nclouds. The resulting framework enables swift and accurate view synthesis for\ndesired views of target frames. The experimental results obtained demonstrate\nthe considerable acceleration of training time achieved - typically an order of\nmagnitude - by our proposed method while yielding comparable outcomes compared\nto prior approaches. Furthermore, our method exhibits strong robustness in\nhandling long-duration videos without learning a canonical representation of\nvideo content.\n","authors":["Kaichen Zhou","Jia-Xing Zhong","Sangyun Shin","Kai Lu","Yiyuan Yang","Andrew Markham","Niki Trigoni"],"pdf_url":"https://arxiv.org/pdf/2310.18999v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2301.10766v2","updated":"2024-01-19T01:51:45Z","published":"2023-01-25T18:59:15Z","title":"On the Adversarial Robustness of Camera-based 3D Object Detection","summary":" In recent years, camera-based 3D object detection has gained widespread\nattention for its ability to achieve high performance with low computational\ncost. However, the robustness of these methods to adversarial attacks has not\nbeen thoroughly examined, especially when considering their deployment in\nsafety-critical domains like autonomous driving. In this study, we conduct the\nfirst comprehensive investigation of the robustness of leading camera-based 3D\nobject detection approaches under various adversarial conditions. We\nsystematically analyze the resilience of these models under two attack\nsettings: white-box and black-box; focusing on two primary objectives:\nclassification and localization. Additionally, we delve into two types of\nadversarial attack techniques: pixel-based and patch-based. Our experiments\nyield four interesting findings: (a) bird's-eye-view-based representations\nexhibit stronger robustness against localization attacks; (b)\ndepth-estimation-free approaches have the potential to show stronger\nrobustness; (c) accurate depth estimation effectively improves robustness for\ndepth-estimation-based methods; (d) incorporating multi-frame benign inputs can\neffectively mitigate adversarial attacks. We hope our findings can steer the\ndevelopment of future camera-based object detection models with enhanced\nadversarial robustness.\n","authors":["Shaoyuan Xie","Zichao Li","Zeyu Wang","Cihang Xie"],"pdf_url":"https://arxiv.org/pdf/2301.10766v2.pdf","comment":"Transactions on Machine Learning Research, 2024. ISSN 2835-8856"},{"id":"http://arxiv.org/abs/2312.06955v2","updated":"2024-01-19T01:47:22Z","published":"2023-12-12T03:26:04Z","title":"IA2U: A Transfer Plugin with Multi-Prior for In-Air Model to Underwater","summary":" In underwater environments, variations in suspended particle concentration\nand turbidity cause severe image degradation, posing significant challenges to\nimage enhancement (IE) and object detection (OD) tasks. Currently, in-air image\nenhancement and detection methods have made notable progress, but their\napplication in underwater conditions is limited due to the complexity and\nvariability of these environments. Fine-tuning in-air models saves high\noverhead and has more optional reference work than building an underwater model\nfrom scratch. To address these issues, we design a transfer plugin with\nmultiple priors for converting in-air models to underwater applications, named\nIA2U. IA2U enables efficient application in underwater scenarios, thereby\nimproving performance in Underwater IE and OD. IA2U integrates three types of\nunderwater priors: the water type prior that characterizes the degree of image\ndegradation, such as color and visibility; the degradation prior, focusing on\ndifferences in details and textures; and the sample prior, considering the\nenvironmental conditions at the time of capture and the characteristics of the\nphotographed object. Utilizing a Transformer-like structure, IA2U employs these\npriors as query conditions and a joint task loss function to achieve\nhierarchical enhancement of task-level underwater image features, therefore\nconsidering the requirements of two different tasks, IE and OD. Experimental\nresults show that IA2U combined with an in-air model can achieve superior\nperformance in underwater image enhancement and object detection tasks. The\ncode will be made publicly available.\n","authors":["Jingchun Zhou","Qilin Gai","Kin-man Lam","Xianping Fu"],"pdf_url":"https://arxiv.org/pdf/2312.06955v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.06999v2","updated":"2024-01-19T01:46:49Z","published":"2023-12-12T06:07:21Z","title":"DGNet: Dynamic Gradient-guided Network with Noise Suppression for\n Underwater Image Enhancement","summary":" Underwater image enhancement (UIE) is a challenging task due to the complex\ndegradation caused by underwater environments. To solve this issue, previous\nmethods often idealize the degradation process, and neglect the impact of\nmedium noise and object motion on the distribution of image features, limiting\nthe generalization and adaptability of the model. Previous methods use the\nreference gradient that is constructed from original images and synthetic\nground-truth images. This may cause the network performance to be influenced by\nsome low-quality training data. Our approach utilizes predicted images to\ndynamically update pseudo-labels, adding a dynamic gradient to optimize the\nnetwork's gradient space. This process improves image quality and avoids local\noptima. Moreover, we propose a Feature Restoration and Reconstruction module\n(FRR) based on a Channel Combination Inference (CCI) strategy and a Frequency\nDomain Smoothing module (FRS). These modules decouple other degradation\nfeatures while reducing the impact of various types of noise on network\nperformance. Experiments on multiple public datasets demonstrate the\nsuperiority of our method over existing state-of-the-art approaches, especially\nin achieving performance milestones: PSNR of 25.6dB and SSIM of 0.93 on the\nUIEB dataset. Its efficiency in terms of parameter size and inference time\nfurther attests to its broad practicality. The code will be made publicly\navailable.\n","authors":["Jingchun Zhou","Zongxin He","Dehuan Zhang","Kin-man Lam","Xianping Fu","Yi Wang"],"pdf_url":"https://arxiv.org/pdf/2312.06999v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10442v1","updated":"2024-01-19T01:11:44Z","published":"2024-01-19T01:11:44Z","title":"Path Choice Matters for Clear Attribution in Path Methods","summary":" Rigorousness and clarity are both essential for interpretations of DNNs to\nengender human trust. Path methods are commonly employed to generate rigorous\nattributions that satisfy three axioms. However, the meaning of attributions\nremains ambiguous due to distinct path choices. To address the ambiguity, we\nintroduce \\textbf{Concentration Principle}, which centrally allocates high\nattributions to indispensable features, thereby endowing aesthetic and\nsparsity. We then present \\textbf{SAMP}, a model-agnostic interpreter, which\nefficiently searches the near-optimal path from a pre-defined set of\nmanipulation paths. Moreover, we propose the infinitesimal constraint (IC) and\nmomentum strategy (MS) to improve the rigorousness and optimality.\nVisualizations show that SAMP can precisely reveal DNNs by pinpointing salient\nimage pixels. We also perform quantitative experiments and observe that our\nmethod significantly outperforms the counterparts. Code:\nhttps://github.com/zbr17/SAMP.\n","authors":["Borui Zhang","Wenzhao Zheng","Jie Zhou","Jiwen Lu"],"pdf_url":"https://arxiv.org/pdf/2401.10442v1.pdf","comment":"ICLR 2024 accepted"},{"id":"http://arxiv.org/abs/2304.00746v3","updated":"2024-01-19T00:42:13Z","published":"2023-04-03T06:40:52Z","title":"OTS: A One-shot Learning Approach for Text Spotting in Historical\n Manuscripts","summary":" In the field of historical manuscript research, scholars frequently encounter\nnovel symbols in ancient texts, investing considerable effort in their\nidentification and documentation. Although some object detection methods have\nachieved impressive performance, they primarily excel at detecting categories\nincluded in training datasets, often failing to recognize novel symbols without\nretraining. To overcome this limitation, we propose a novel One-shot\nlearning-based Text Spotting (OTS) approach that accurately and reliably spots\nnovel characters with just one annotated support sample. Drawing inspiration\nfrom cognitive research, we introduce a spatial alignment module that finds,\nfocuses on, and learns the most discriminative spatial regions in the query\nimage based on one support image. Especially, since the low-resource spotting\ntask often faces the problem of example imbalance, we propose a novel loss\nfunction called torus loss which can make the embedding space of distance\nmetric more discriminative. Our approach is highly efficient and requires only\na few training samples while exhibiting the remarkable ability to handle novel\ncharacters and symbols. To enhance dataset diversity, a new manuscript dataset\nthat contains the ancient Dongba hieroglyphics (DBH) is created, a script\nassociated with China and developed by the ancestors of the Naxi minority. We\nconduct experiments on publicly available DBH, EGY, VML-HD, TKH, and NC\ndatasets. The experimental results demonstrate that OTS outperforms the\nstate-of-the-art methods in one-shot text spotting. Overall, our proposed\nmethod offers promising applications in text spotting in historical\nmanuscripts.\n","authors":["Wenbo Hu","Hongjian Zhan","Cong Liu","Bing Yin","Yue Lu"],"pdf_url":"https://arxiv.org/pdf/2304.00746v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.00110v3","updated":"2024-01-19T00:35:35Z","published":"2023-12-30T01:24:25Z","title":"Diffusion Model with Perceptual Loss","summary":" Diffusion models trained with mean squared error loss tend to generate\nunrealistic samples. Current state-of-the-art models rely on classifier-free\nguidance to improve sample quality, yet its surprising effectiveness is not\nfully understood. In this paper, we show that the effectiveness of\nclassifier-free guidance partly originates from it being a form of implicit\nperceptual guidance. As a result, we can directly incorporate perceptual loss\nin diffusion training to improve sample quality. Since the score matching\nobjective used in diffusion training strongly resembles the denoising\nautoencoder objective used in unsupervised training of perceptual networks, the\ndiffusion model itself is a perceptual network and can be used to generate\nmeaningful perceptual loss. We propose a novel self-perceptual objective that\nresults in diffusion models capable of generating more realistic samples. For\nconditional generation, our method only improves sample quality without\nentanglement with the conditional input and therefore does not sacrifice sample\ndiversity. Our method can also improve sample quality for unconditional\ngeneration, which was not possible with classifier-free guidance before.\n","authors":["Shanchuan Lin","Xiao Yang"],"pdf_url":"https://arxiv.org/pdf/2401.00110v3.pdf","comment":null}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2401.10841v1","updated":"2024-01-19T17:40:50Z","published":"2024-01-19T17:40:50Z","title":"Using LLMs to discover emerging coded antisemitic hate-speech emergence\n in extremist social media","summary":" Online hate speech proliferation has created a difficult problem for social\nmedia platforms. A particular challenge relates to the use of coded language by\ngroups interested in both creating a sense of belonging for its users and\nevading detection. Coded language evolves quickly and its use varies over time.\nThis paper proposes a methodology for detecting emerging coded hate-laden\nterminology. The methodology is tested in the context of online antisemitic\ndiscourse. The approach considers posts scraped from social media platforms,\noften used by extremist users. The posts are scraped using seed expressions\nrelated to previously known discourse of hatred towards Jews. The method begins\nby identifying the expressions most representative of each post and calculating\ntheir frequency in the whole corpus. It filters out grammatically incoherent\nexpressions as well as previously encountered ones so as to focus on emergent\nwell-formed terminology. This is followed by an assessment of semantic\nsimilarity to known antisemitic terminology using a fine-tuned large language\nmodel, and subsequent filtering out of the expressions that are too distant\nfrom known expressions of hatred. Emergent antisemitic expressions containing\nterms clearly relating to Jewish topics are then removed to return only coded\nexpressions of hatred.\n","authors":["Dhanush Kikkisetti","Raza Ul Mustafa","Wendy Melillo","Roberto Corizzo","Zois Boukouvalas","Jeff Gill","Nathalie Japkowicz"],"pdf_url":"https://arxiv.org/pdf/2401.10841v1.pdf","comment":"9 pages, 4 figures, 2 algorithms, 3 tables"},{"id":"http://arxiv.org/abs/2312.09631v2","updated":"2024-01-19T17:07:40Z","published":"2023-12-15T09:21:11Z","title":"Context-Driven Interactive Query Simulations Based on Generative Large\n Language Models","summary":" Simulating user interactions enables a more user-oriented evaluation of\ninformation retrieval (IR) systems. While user simulations are cost-efficient\nand reproducible, many approaches often lack fidelity regarding real user\nbehavior. Most notably, current user models neglect the user's context, which\nis the primary driver of perceived relevance and the interactions with the\nsearch results. To this end, this work introduces the simulation of\ncontext-driven query reformulations. The proposed query generation methods\nbuild upon recent Large Language Model (LLM) approaches and consider the user's\ncontext throughout the simulation of a search session. Compared to simple\ncontext-free query generation approaches, these methods show better\neffectiveness and allow the simulation of more efficient IR sessions.\nSimilarly, our evaluations consider more interaction context than current\nsession-based measures and reveal interesting complementary insights in\naddition to the established evaluation protocols. We conclude with directions\nfor future work and provide an entirely open experimental setup.\n","authors":["Björn Engelmann","Timo Breuer","Jana Isabelle Friese","Philipp Schaer","Norbert Fuhr"],"pdf_url":"https://arxiv.org/pdf/2312.09631v2.pdf","comment":"Accepted at ECIR 2024 (Full Paper)"},{"id":"http://arxiv.org/abs/2308.07107v3","updated":"2024-01-19T16:01:28Z","published":"2023-08-14T12:47:22Z","title":"Large Language Models for Information Retrieval: A Survey","summary":" As a primary means of information acquisition, information retrieval (IR)\nsystems, such as search engines, have integrated themselves into our daily\nlives. These systems also serve as components of dialogue, question-answering,\nand recommender systems. The trajectory of IR has evolved dynamically from its\norigins in term-based methods to its integration with advanced neural models.\nWhile the neural models excel at capturing complex contextual signals and\nsemantic nuances, thereby reshaping the IR landscape, they still face\nchallenges such as data scarcity, interpretability, and the generation of\ncontextually plausible yet potentially inaccurate responses. This evolution\nrequires a combination of both traditional methods (such as term-based sparse\nretrieval methods with rapid response) and modern neural architectures (such as\nlanguage models with powerful language understanding capacity). Meanwhile, the\nemergence of large language models (LLMs), typified by ChatGPT and GPT-4, has\nrevolutionized natural language processing due to their remarkable language\nunderstanding, generation, generalization, and reasoning abilities.\nConsequently, recent research has sought to leverage LLMs to improve IR\nsystems. Given the rapid evolution of this research trajectory, it is necessary\nto consolidate existing methodologies and provide nuanced insights through a\ncomprehensive overview. In this survey, we delve into the confluence of LLMs\nand IR systems, including crucial aspects such as query rewriters, retrievers,\nrerankers, and readers. Additionally, we explore promising directions, such as\nsearch agents, within this expanding field.\n","authors":["Yutao Zhu","Huaying Yuan","Shuting Wang","Jiongnan Liu","Wenhan Liu","Chenlong Deng","Haonan Chen","Zhicheng Dou","Ji-Rong Wen"],"pdf_url":"https://arxiv.org/pdf/2308.07107v3.pdf","comment":"updated to version 2"},{"id":"http://arxiv.org/abs/2401.10733v1","updated":"2024-01-19T14:50:22Z","published":"2024-01-19T14:50:22Z","title":"Dynamic Q&A of Clinical Documents with Large Language Models","summary":" Electronic health records (EHRs) house crucial patient data in clinical\nnotes. As these notes grow in volume and complexity, manual extraction becomes\nchallenging. This work introduces a natural language interface using large\nlanguage models (LLMs) for dynamic question-answering on clinical notes. Our\nchatbot, powered by Langchain and transformer-based LLMs, allows users to query\nin natural language, receiving relevant answers from clinical notes.\nExperiments, utilizing various embedding models and advanced LLMs, show Wizard\nVicuna's superior accuracy, albeit with high compute demands. Model\noptimization, including weight quantization, improves latency by approximately\n48 times. Promising results indicate potential, yet challenges such as model\nhallucinations and limited diverse medical case evaluations remain. Addressing\nthese gaps is crucial for unlocking the value in clinical notes and advancing\nAI-driven clinical decision-making.\n","authors":["Ran Elgedawy","Sudarshan Srinivasan","Ioana Danciu"],"pdf_url":"https://arxiv.org/pdf/2401.10733v1.pdf","comment":"8 pages, 4 figures"},{"id":"http://arxiv.org/abs/2401.10690v1","updated":"2024-01-19T13:41:08Z","published":"2024-01-19T13:41:08Z","title":"Beyond RMSE and MAE: Introducing EAUC to unmask hidden bias and\n unfairness in dyadic regression models","summary":" Dyadic regression models, which predict real-valued outcomes for pairs of\nentities, are fundamental in many domains (e.g. predicting the rating of a user\nto a product in Recommender Systems) and promising and under exploration in\nmany others (e.g. approximating the adequate dosage of a drug for a patient in\npersonalized pharmacology). In this work, we demonstrate that non-uniformity in\nthe observed value distributions of individual entities leads to severely\nbiased predictions in state-of-the-art models, skewing predictions towards the\naverage of observed past values for the entity and providing worse-than-random\npredictive power in eccentric yet equally important cases. We show that the\nusage of global error metrics like Root Mean Squared Error (RMSE) and Mean\nAbsolute Error (MAE) is insufficient to capture this phenomenon, which we name\neccentricity bias, and we introduce Eccentricity-Area Under the Curve (EAUC) as\na new complementary metric that can quantify it in all studied models and\ndatasets. We also prove the adequateness of EAUC by using naive de-biasing\ncorrections to demonstrate that a lower model bias correlates with a lower EAUC\nand vice-versa. This work contributes a bias-aware evaluation of dyadic\nregression models to avoid potential unfairness and risks in critical\nreal-world applications of such systems.\n","authors":["Jorge Paz-Ruza","Amparo Alonso-Betanzos","Bertha Guijarro-Berdiñas","Brais Cancela","Carlos Eiras-Franco"],"pdf_url":"https://arxiv.org/pdf/2401.10690v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10634v1","updated":"2024-01-19T11:22:04Z","published":"2024-01-19T11:22:04Z","title":"Automatic Construction of Multi-faceted User Profiles using Text\n Clustering and its Application to Expert Recommendation and Filtering\n Problems","summary":" In the information age we are living in today, not only are we interested in\naccessing multimedia objects such as documents, videos, etc. but also in\nsearching for professional experts, people or celebrities, possibly for\nprofessional needs or just for fun. Information access systems need to be able\nto extract and exploit various sources of information (usually in text format)\nabout such individuals, and to represent them in a suitable way usually in the\nform of a profile. In this article, we tackle the problems of profile-based\nexpert recommendation and document filtering from a machine learning\nperspective by clustering expert textual sources to build profiles and capture\nthe different hidden topics in which the experts are interested. The experts\nwill then be represented by means of multi-faceted profiles. Our experiments\nshow that this is a valid technique to improve the performance of expert\nfinding and document filtering.\n","authors":["Luis M. de Campos","Juan M. Fernández-Luna","Juan F. Huete","Luis Redondo-Expósito"],"pdf_url":"https://arxiv.org/pdf/2401.10634v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10617v1","updated":"2024-01-19T10:49:31Z","published":"2024-01-19T10:49:31Z","title":"LDA-based Term Profiles for Expert Finding in a Political Setting","summary":" A common task in many political institutions (i.e. Parliament) is to find\npoliticians who are experts in a particular field. In order to tackle this\nproblem, the first step is to obtain politician profiles which include their\ninterests, and these can be automatically learned from their speeches. As a\npolitician may have various areas of expertise, one alternative is to use a set\nof subprofiles, each of which covers a different subject. In this study, we\npropose a novel approach for this task by using latent Dirichlet allocation\n(LDA) to determine the main underlying topics of each political speech, and to\ndistribute the related terms among the different topic-based subprofiles. With\nthis objective, we propose the use of fifteen distance and similarity measures\nto automatically determine the optimal number of topics discussed in a\ndocument, and to demonstrate that every measure converges into five strategies:\nEuclidean, Dice, Sorensen, Cosine and Overlap. Our experimental results showed\nthat the scores of the different accuracy metrics of the proposed strategies\ntended to be higher than those of the baselines for expert recommendation\ntasks, and that the use of an appropriate number of topics has proved relevant.\n","authors":["Luis M. de Campos","Juan M. Fernández-Luna","Juan F. Huete","Luis Redondo-Expósito"],"pdf_url":"https://arxiv.org/pdf/2401.10617v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10611v1","updated":"2024-01-19T10:42:29Z","published":"2024-01-19T10:42:29Z","title":"Publication venue recommendation using profiles based on clustering","summary":" In this paper we study the venue recommendation problem in order to help\nresearchers to identify a journal or conference to submit a given paper. A\ncommon approach to tackle this problem is to build profiles defining the scope\nof each venue. Then, these profiles are compared against the target paper. In\nour approach we will study how clustering techniques can be used to construct\ntopic-based profiles and use an Information Retrieval based approach to obtain\nthe final recommendations. Additionally, we will explore how the use of\nauthorship, representing a complementary piece of information, helps to improve\nthe recommendations.\n","authors":["Luis M. de Campos","Juan M. Fernández-Luna","Juan F. Huete"],"pdf_url":"https://arxiv.org/pdf/2401.10611v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10607v1","updated":"2024-01-19T10:32:28Z","published":"2024-01-19T10:32:28Z","title":"Use of topical and temporal profiles and their hybridisation for\n content-based recommendation","summary":" In the context of content-based recommender systems, the aim of this paper is\nto determine how better profiles can be built and how these affect the\nrecommendation process based on the incorporation of temporality, i.e. the\ninclusion of time in the recommendation process, and topicality, i.e. the\nrepresentation of texts associated with users and items using topics and their\ncombination. The main contribution of the paper is to present two different\nways of hybridising these two dimensions and to evaluate and compare them with\nother alternatives.\n","authors":["Luis M. de Campos","Juan M. Fernández-Luna","Juan F. Huete"],"pdf_url":"https://arxiv.org/pdf/2401.10607v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10545v1","updated":"2024-01-19T08:09:20Z","published":"2024-01-19T08:09:20Z","title":"Understanding Biases in ChatGPT-based Recommender Systems: Provider\n Fairness, Temporal Stability, and Recency","summary":" This study explores the nuanced capabilities and inherent biases of\nRecommender Systems using Large Language Models (RecLLMs), with a focus on\nChatGPT-based systems. It studies into the contrasting behaviors of generative\nmodels and traditional collaborative filtering models in movie recommendations.\nThe research primarily investigates prompt design strategies and their impact\non various aspects of recommendation quality, including accuracy, provider\nfairness, diversity, stability, genre dominance, and temporal freshness\n(recency).\n Our experimental analysis reveals that the introduction of specific 'system\nroles' and 'prompt strategies' in RecLLMs significantly influences their\nperformance. For instance, role-based prompts enhance fairness and diversity in\nrecommendations, mitigating popularity bias. We find that while GPT-based\nmodels do not always match the performance of CF baselines, they exhibit a\nunique tendency to recommend newer and more diverse movie genres. Notably,\nGPT-based models tend to recommend more recent films, particularly those\nreleased post-2000, and show a preference for genres like \\sq{Drama} and\nComedy, and Romance (compared to CF Action, Adventure) presumably due to the\nRecLLMs' training on varied data sets, which allows them to capture recent\ntrends and discussions more effectively than CF models. Interestingly, our\nresults demonstrate that the 'Simple' and 'Chain of Thought (COT)' paradigms\nyield the highest accuracy. These findings imply the potential of combining\nthese strategies with scenarios that favor more recent content, thereby\noffering a more balanced and up-to-date recommendation experience. This study\ncontributes significantly to the understanding of emerging RecLLMs,\nparticularly in the context of harms and biases within these systems.\n","authors":["Yashar Deldjoo"],"pdf_url":"https://arxiv.org/pdf/2401.10545v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.04971v2","updated":"2024-01-19T07:52:57Z","published":"2024-01-10T07:31:26Z","title":"A Survey on Cross-Domain Sequential Recommendation","summary":" Cross-domain sequential recommendation (CDSR) shifts the modeling of user\npreferences from flat to stereoscopic by integrating and learning interaction\ninformation from multiple domains at different granularities (ranging from\ninter-sequence to intra-sequence and from single-domain to cross-domain). In\nthis survey, we first define the CDSR problem using a four-dimensional tensor\nand then analyze its multi-type input representations under multidirectional\ndimensionality reductions. Following that, we provide a systematic overview\nfrom both macro and micro views. From a macro view, we abstract the multi-level\nfusion structures of various models across domains and discuss their bridges\nfor fusion. From a micro view, focusing on the existing models, we specifically\ndiscuss the basic technologies and then explain the auxiliary learning\ntechnologies. Finally, we exhibit the available public datasets and the\nrepresentative experimental results as well as provide some insights into\nfuture directions for research in CDSR.\n","authors":["Shu Chen","Zitao Xu","Weike Pan","Qiang Yang","Zhong Ming"],"pdf_url":"https://arxiv.org/pdf/2401.04971v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.09885v2","updated":"2024-01-19T07:23:04Z","published":"2024-01-18T10:56:27Z","title":"Source Code Clone Detection Using Unsupervised Similarity Measures","summary":" Assessing similarity in source code has gained significant attention in\nrecent years due to its importance in software engineering tasks such as clone\ndetection and code search and recommendation. This work presents a comparative\nanalysis of unsupervised similarity measures for identifying source code clone\ndetection. The goal is to overview the current state-of-the-art techniques,\ntheir strengths, and weaknesses. To do that, we compile the existing\nunsupervised strategies and evaluate their performance on a benchmark dataset\nto guide software engineers in selecting appropriate methods for their specific\nuse cases. The source code of this study is available at\nhttps://github.com/jorge-martinez-gil/codesim\n","authors":["Jorge Martinez-Gil"],"pdf_url":"https://arxiv.org/pdf/2401.09885v2.pdf","comment":"Accepted for publication as Full Paper in the Software Quality Days\n 2024, Vienna, Austria"},{"id":"http://arxiv.org/abs/2401.00368v2","updated":"2024-01-19T05:16:20Z","published":"2023-12-31T02:13:18Z","title":"Improving Text Embeddings with Large Language Models","summary":" In this paper, we introduce a novel and simple method for obtaining\nhigh-quality text embeddings using only synthetic data and less than 1k\ntraining steps. Unlike existing methods that often depend on multi-stage\nintermediate pre-training with billions of weakly-supervised text pairs,\nfollowed by fine-tuning with a few labeled datasets, our method does not\nrequire building complex training pipelines or relying on manually collected\ndatasets that are often constrained by task diversity and language coverage. We\nleverage proprietary LLMs to generate diverse synthetic data for hundreds of\nthousands of text embedding tasks across nearly 100 languages. We then\nfine-tune open-source decoder-only LLMs on the synthetic data using standard\ncontrastive loss. Experiments demonstrate that our method achieves strong\nperformance on highly competitive text embedding benchmarks without using any\nlabeled data. Furthermore, when fine-tuned with a mixture of synthetic and\nlabeled data, our model sets new state-of-the-art results on the BEIR and MTEB\nbenchmarks.\n","authors":["Liang Wang","Nan Yang","Xiaolong Huang","Linjun Yang","Rangan Majumder","Furu Wei"],"pdf_url":"https://arxiv.org/pdf/2401.00368v2.pdf","comment":"20 pages, 15 tables"},{"id":"http://arxiv.org/abs/2401.10487v1","updated":"2024-01-19T04:24:07Z","published":"2024-01-19T04:24:07Z","title":"Generative Dense Retrieval: Memory Can Be a Burden","summary":" Generative Retrieval (GR), autoregressively decoding relevant document\nidentifiers given a query, has been shown to perform well under the setting of\nsmall-scale corpora. By memorizing the document corpus with model parameters,\nGR implicitly achieves deep interaction between query and document. However,\nsuch a memorizing mechanism faces three drawbacks: (1) Poor memory accuracy for\nfine-grained features of documents; (2) Memory confusion gets worse as the\ncorpus size increases; (3) Huge memory update costs for new documents. To\nalleviate these problems, we propose the Generative Dense Retrieval (GDR)\nparadigm. Specifically, GDR first uses the limited memory volume to achieve\ninter-cluster matching from query to relevant document clusters.\nMemorizing-free matching mechanism from Dense Retrieval (DR) is then introduced\nto conduct fine-grained intra-cluster matching from clusters to relevant\ndocuments. The coarse-to-fine process maximizes the advantages of GR's deep\ninteraction and DR's scalability. Besides, we design a cluster identifier\nconstructing strategy to facilitate corpus memory and a cluster-adaptive\nnegative sampling strategy to enhance the intra-cluster mapping ability.\nEmpirical results show that GDR obtains an average of 3.0 R@100 improvement on\nNQ dataset under multiple settings and has better scalability.\n","authors":["Peiwen Yuan","Xinglin Wang","Shaoxiong Feng","Boyuan Pan","Yiwei Li","Heda Wang","Xupeng Miao","Kan Li"],"pdf_url":"https://arxiv.org/pdf/2401.10487v1.pdf","comment":"EACL 2024 main"},{"id":"http://arxiv.org/abs/2401.10484v1","updated":"2024-01-19T04:17:50Z","published":"2024-01-19T04:17:50Z","title":"Enhancing Scalability in Recommender Systems through Lottery Ticket\n Hypothesis and Knowledge Distillation-based Neural Network Pruning","summary":" This study introduces an innovative approach aimed at the efficient pruning\nof neural networks, with a particular focus on their deployment on edge\ndevices. Our method involves the integration of the Lottery Ticket Hypothesis\n(LTH) with the Knowledge Distillation (KD) framework, resulting in the\nformulation of three distinct pruning models. These models have been developed\nto address scalability issue in recommender systems, whereby the complexities\nof deep learning models have hindered their practical deployment. With\njudicious application of the pruning techniques, we effectively curtail the\npower consumption and model dimensions without compromising on accuracy.\nEmpirical evaluation has been performed using two real world datasets from\ndiverse domains against two baselines. Gratifyingly, our approaches yielded a\nGPU computation-power reduction of up to 66.67%. Notably, our study contributes\nto the field of recommendation system by pioneering the application of LTH and\nKD.\n","authors":["Rajaram R","Manoj Bharadhwaj","Vasan VS","Nargis Pervin"],"pdf_url":"https://arxiv.org/pdf/2401.10484v1.pdf","comment":"Accepted in WITS 2023 as a workshop paper"},{"id":"http://arxiv.org/abs/2401.10963v1","updated":"2024-01-19T11:50:26Z","published":"2024-01-19T11:50:26Z","title":"On the selection of the correct number of terms for profile\n construction: theoretical and empirical analysis","summary":" In this paper, we examine the problem of building a user profile from a set\nof documents. This profile will consist of a subset of the most representative\nterms in the documents that best represent user preferences or interests.\nInspired by the discrete concentration theory we have conducted an axiomatic\nstudy of seven properties that a selection function should fulfill: the minimum\nand maximum uncertainty principle, invariant to adding zeros, invariant to\nscale transformations, principle of nominal increase, transfer principle and\nthe richest get richer inequality. We also present a novel selection function\nbased on the use of similarity metrics, and more specifically the cosine\nmeasure which is commonly used in information retrieval, and demonstrate that\nthis verifies six of the properties in addition to a weaker variant of the\ntransfer principle, thereby representing a good selection approach. The\ntheoretical study was complemented with an empirical study to compare the\nperformance of different selection criteria (weight- and unweight-based) using\nreal data in a parliamentary setting. In this study, we analyze the performance\nof the different functions focusing on the two main factors affecting the\nselection process: profile size (number of terms) and weight distribution.\nThese profiles are then used in a document filtering task to show that our\nsimilarity-based approach performs well in terms not only of recommendation\naccuracy but also efficiency (we obtain smaller profiles and consequently\nfaster recommendations).\n","authors":["Luis M. de Campos","Juan M. Fernández-Luna","Juan F. Huete"],"pdf_url":"https://arxiv.org/pdf/2401.10963v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10961v1","updated":"2024-01-19T11:14:37Z","published":"2024-01-19T11:14:37Z","title":"Positive unlabeled learning for building recommender systems in a\n parliamentary setting","summary":" Our goal is to learn about the political interests and preferences of the\nMembers of Parliament by mining their parliamentary activity, in order to\ndevelop a recommendation/filtering system that, given a stream of documents to\nbe distributed among them, is able to decide which documents should receive\neach Member of Parliament. We propose to use positive unlabeled learning to\ntackle this problem, because we only have information about relevant documents\n(the own interventions of each Member of Parliament in the debates) but not\nabout irrelevant documents, so that we cannot use standard binary classifiers\ntrained with positive and negative examples. We have also developed a new\nalgorithm of this type, which compares favourably with: a) the baseline\napproach assuming that all the interventions of other Members of Parliament are\nirrelevant, b) another well-known positive unlabeled learning method and c) an\napproach based on information retrieval methods that matches documents and\nlegislators' representations. The experiments have been carried out with data\nfrom the regional Andalusian Parliament at Spain.\n","authors":["Luis M. de Camposa","Juan M. Fernández-Luna","Juan F. Huete","Luis Redondo-Expósito"],"pdf_url":"https://arxiv.org/pdf/2401.10961v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10956v1","updated":"2024-01-19T05:54:35Z","published":"2024-01-19T05:54:35Z","title":"AI Revolution on Chat Bot: Evidence from a Randomized Controlled\n Experiment","summary":" In recent years, generative AI has undergone major advancements,\ndemonstrating significant promise in augmenting human productivity. Notably,\nlarge language models (LLM), with ChatGPT-4 as an example, have drawn\nconsiderable attention. Numerous articles have examined the impact of LLM-based\ntools on human productivity in lab settings and designed tasks or in\nobservational studies. Despite recent advances, field experiments applying\nLLM-based tools in realistic settings are limited. This paper presents the\nfindings of a field randomized controlled trial assessing the effectiveness of\nLLM-based tools in providing unmonitored support services for information\nretrieval.\n","authors":["Sida Peng","Wojciech Swiatek","Allen Gao","Paul Cullivan","Haoge Chang"],"pdf_url":"https://arxiv.org/pdf/2401.10956v1.pdf","comment":null}],"Machine Learning":[{"id":"http://arxiv.org/abs/2401.10886v1","updated":"2024-01-19T18:57:46Z","published":"2024-01-19T18:57:46Z","title":"SCENES: Subpixel Correspondence Estimation With Epipolar Supervision","summary":" Extracting point correspondences from two or more views of a scene is a\nfundamental computer vision problem with particular importance for relative\ncamera pose estimation and structure-from-motion. Existing local feature\nmatching approaches, trained with correspondence supervision on large-scale\ndatasets, obtain highly-accurate matches on the test sets. However, they do not\ngeneralise well to new datasets with different characteristics to those they\nwere trained on, unlike classic feature extractors. Instead, they require\nfinetuning, which assumes that ground-truth correspondences or ground-truth\ncamera poses and 3D structure are available. We relax this assumption by\nremoving the requirement of 3D structure, e.g., depth maps or point clouds, and\nonly require camera pose information, which can be obtained from odometry. We\ndo so by replacing correspondence losses with epipolar losses, which encourage\nputative matches to lie on the associated epipolar line. While weaker than\ncorrespondence supervision, we observe that this cue is sufficient for\nfinetuning existing models on new data. We then further relax the assumption of\nknown camera poses by using pose estimates in a novel bootstrapping approach.\nWe evaluate on highly challenging datasets, including an indoor drone dataset\nand an outdoor smartphone camera dataset, and obtain state-of-the-art results\nwithout strong supervision.\n","authors":["Dominik A. Kloepfer","João F. Henriques","Dylan Campbell"],"pdf_url":"https://arxiv.org/pdf/2401.10886v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10874v1","updated":"2024-01-19T18:33:52Z","published":"2024-01-19T18:33:52Z","title":"Applications of flow models to the generation of correlated lattice QCD\n ensembles","summary":" Machine-learned normalizing flows can be used in the context of lattice\nquantum field theory to generate statistically correlated ensembles of lattice\ngauge fields at different action parameters. This work demonstrates how these\ncorrelations can be exploited for variance reduction in the computation of\nobservables. Three different proof-of-concept applications are demonstrated\nusing a novel residual flow architecture: continuum limits of gauge theories,\nthe mass dependence of QCD observables, and hadronic matrix elements based on\nthe Feynman-Hellmann approach. In all three cases, it is shown that statistical\nuncertainties are significantly reduced when machine-learned flows are\nincorporated as compared with the same calculations performed with uncorrelated\nensembles or direct reweighting.\n","authors":["Ryan Abbott","Aleksandar Botev","Denis Boyda","Daniel C. Hackett","Gurtej Kanwar","Sébastien Racanière","Danilo J. Rezende","Fernando Romero-López","Phiala E. Shanahan","Julian M. Urban"],"pdf_url":"https://arxiv.org/pdf/2401.10874v1.pdf","comment":"11 pages, 2 tables, 5 figures"},{"id":"http://arxiv.org/abs/2306.00119v2","updated":"2024-01-19T18:30:27Z","published":"2023-05-31T18:48:16Z","title":"Optimal Sets and Solution Paths of ReLU Networks","summary":" We develop an analytical framework to characterize the set of optimal ReLU\nneural networks by reformulating the non-convex training problem as a convex\nprogram. We show that the global optima of the convex parameterization are\ngiven by a polyhedral set and then extend this characterization to the optimal\nset of the non-convex training objective. Since all stationary points of the\nReLU training problem can be represented as optima of sub-sampled convex\nprograms, our work provides a general expression for all critical points of the\nnon-convex objective. We then leverage our results to provide an optimal\npruning algorithm for computing minimal networks, establish conditions for the\nregularization path of ReLU networks to be continuous, and develop sensitivity\nresults for minimal ReLU networks.\n","authors":["Aaron Mishkin","Mert Pilanci"],"pdf_url":"https://arxiv.org/pdf/2306.00119v2.pdf","comment":"Minor updates and corrections to clarify the role of merge/split\n symmetries in formation of ReLU optimal set and add missing sufficient\n conditions for all minimal models to have the same cardinality"},{"id":"http://arxiv.org/abs/2401.10862v1","updated":"2024-01-19T18:05:34Z","published":"2024-01-19T18:05:34Z","title":"Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs\n Without Fine-Tuning","summary":" Large Language Models (LLMs) are vulnerable to `Jailbreaking' prompts, a type\nof attack that can coax these models into generating harmful and illegal\ncontent. In this paper, we show that pruning up to 20% of LLM parameters\nmarkedly increases their resistance to such attacks without additional training\nand without sacrificing their performance in standard benchmarks. Intriguingly,\nwe discovered that the enhanced safety observed post-pruning correlates to the\ninitial safety training level of the model, hinting that the effect of pruning\ncould be more general and may hold for other LLM behaviors beyond safety.\nAdditionally, we introduce a curated dataset of 225 harmful tasks across five\ncategories, inserted into ten different Jailbreaking prompts, showing that\npruning aids LLMs in concentrating attention on task-relevant tokens in\njailbreaking prompts. Lastly, our experiments reveal that the prominent chat\nmodels, such as LLaMA-2 Chat, Vicuna, and Mistral Instruct exhibit high\nsusceptibility to jailbreaking attacks, with some categories achieving nearly\n70-100% success rate. These insights underline the potential of pruning as a\ngeneralizable approach for improving LLM safety, reliability, and potentially\nother desired behaviors.\n","authors":["Adib Hasan","Ileana Rugina","Alex Wang"],"pdf_url":"https://arxiv.org/pdf/2401.10862v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10859v1","updated":"2024-01-19T18:03:21Z","published":"2024-01-19T18:03:21Z","title":"Ensembler: Combating model inversion attacks using model ensemble during\n collaborative inference","summary":" Deep learning models have exhibited remarkable performance across various\ndomains. Nevertheless, the burgeoning model sizes compel edge devices to\noffload a significant portion of the inference process to the cloud. While this\npractice offers numerous advantages, it also raises critical concerns regarding\nuser data privacy. In scenarios where the cloud server's trustworthiness is in\nquestion, the need for a practical and adaptable method to safeguard data\nprivacy becomes imperative. In this paper, we introduce Ensembler, an\nextensible framework designed to substantially increase the difficulty of\nconducting model inversion attacks for adversarial parties. Ensembler leverages\nmodel ensembling on the adversarial server, running in parallel with existing\napproaches that introduce perturbations to sensitive data during colloborative\ninference. Our experiments demonstrate that when combined with even basic\nGaussian noise, Ensembler can effectively shield images from reconstruction\nattacks, achieving recognition levels that fall below human performance in some\nstrict settings, significantly outperforming baseline methods lacking the\nEnsembler framework.\n","authors":["Dancheng Liu","Jinjun Xiong"],"pdf_url":"https://arxiv.org/pdf/2401.10859v1.pdf","comment":"in submission"},{"id":"http://arxiv.org/abs/2401.10841v1","updated":"2024-01-19T17:40:50Z","published":"2024-01-19T17:40:50Z","title":"Using LLMs to discover emerging coded antisemitic hate-speech emergence\n in extremist social media","summary":" Online hate speech proliferation has created a difficult problem for social\nmedia platforms. A particular challenge relates to the use of coded language by\ngroups interested in both creating a sense of belonging for its users and\nevading detection. Coded language evolves quickly and its use varies over time.\nThis paper proposes a methodology for detecting emerging coded hate-laden\nterminology. The methodology is tested in the context of online antisemitic\ndiscourse. The approach considers posts scraped from social media platforms,\noften used by extremist users. The posts are scraped using seed expressions\nrelated to previously known discourse of hatred towards Jews. The method begins\nby identifying the expressions most representative of each post and calculating\ntheir frequency in the whole corpus. It filters out grammatically incoherent\nexpressions as well as previously encountered ones so as to focus on emergent\nwell-formed terminology. This is followed by an assessment of semantic\nsimilarity to known antisemitic terminology using a fine-tuned large language\nmodel, and subsequent filtering out of the expressions that are too distant\nfrom known expressions of hatred. Emergent antisemitic expressions containing\nterms clearly relating to Jewish topics are then removed to return only coded\nexpressions of hatred.\n","authors":["Dhanush Kikkisetti","Raza Ul Mustafa","Wendy Melillo","Roberto Corizzo","Zois Boukouvalas","Jeff Gill","Nathalie Japkowicz"],"pdf_url":"https://arxiv.org/pdf/2401.10841v1.pdf","comment":"9 pages, 4 figures, 2 algorithms, 3 tables"},{"id":"http://arxiv.org/abs/2309.14393v2","updated":"2024-01-19T17:33:44Z","published":"2023-09-25T14:50:04Z","title":"LLMCarbon: Modeling the end-to-end Carbon Footprint of Large Language\n Models","summary":" The carbon footprint associated with large language models (LLMs) is a\nsignificant concern, encompassing emissions from their training, inference,\nexperimentation, and storage processes, including operational and embodied\ncarbon emissions. An essential aspect is accurately estimating the carbon\nimpact of emerging LLMs even before their training, which heavily relies on GPU\nusage. Existing studies have reported the carbon footprint of LLM training, but\nonly one tool, mlco2, can predict the carbon footprint of new neural networks\nprior to physical training. However, mlco2 has several serious limitations. It\ncannot extend its estimation to dense or mixture-of-experts (MoE) LLMs,\ndisregards critical architectural parameters, focuses solely on GPUs, and\ncannot model embodied carbon footprints. Addressing these gaps, we introduce\n\\textit{\\carb}, an end-to-end carbon footprint projection model designed for\nboth dense and MoE LLMs. Compared to mlco2, \\carb~significantly enhances the\naccuracy of carbon footprint estimations for various LLMs. The source code is\nreleased at \\url{https://github.com/SotaroKaneda/MLCarbon}.\n","authors":["Ahmad Faiz","Sotaro Kaneda","Ruhan Wang","Rita Osi","Prateek Sharma","Fan Chen","Lei Jiang"],"pdf_url":"https://arxiv.org/pdf/2309.14393v2.pdf","comment":"15 pages, 8 figures"},{"id":"http://arxiv.org/abs/2211.13350v2","updated":"2024-01-19T17:33:36Z","published":"2022-11-23T23:31:14Z","title":"Choreographer: Learning and Adapting Skills in Imagination","summary":" Unsupervised skill learning aims to learn a rich repertoire of behaviors\nwithout external supervision, providing artificial agents with the ability to\ncontrol and influence the environment. However, without appropriate knowledge\nand exploration, skills may provide control only over a restricted area of the\nenvironment, limiting their applicability. Furthermore, it is unclear how to\nleverage the learned skill behaviors for adapting to downstream tasks in a\ndata-efficient manner. We present Choreographer, a model-based agent that\nexploits its world model to learn and adapt skills in imagination. Our method\ndecouples the exploration and skill learning processes, being able to discover\nskills in the latent state space of the model. During adaptation, the agent\nuses a meta-controller to evaluate and adapt the learned skills efficiently by\ndeploying them in parallel in imagination. Choreographer is able to learn\nskills both from offline data, and by collecting data simultaneously with an\nexploration policy. The skills can be used to effectively adapt to downstream\ntasks, as we show in the URL benchmark, where we outperform previous approaches\nfrom both pixels and states inputs. The learned skills also explore the\nenvironment thoroughly, finding sparse rewards more frequently, as shown in\ngoal-reaching tasks from the DMC Suite and Meta-World. Website and code:\nhttps://skillchoreographer.github.io/\n","authors":["Pietro Mazzaglia","Tim Verbelen","Bart Dhoedt","Alexandre Lacoste","Sai Rajeswar"],"pdf_url":"https://arxiv.org/pdf/2211.13350v2.pdf","comment":"Accepted at ICLR 2023 (notable top 25%)"},{"id":"http://arxiv.org/abs/2401.10831v1","updated":"2024-01-19T17:27:21Z","published":"2024-01-19T17:27:21Z","title":"Understanding Video Transformers via Universal Concept Discovery","summary":" This paper studies the problem of concept-based interpretability of\ntransformer representations for videos. Concretely, we seek to explain the\ndecision-making process of video transformers based on high-level,\nspatiotemporal concepts that are automatically discovered. Prior research on\nconcept-based interpretability has concentrated solely on image-level tasks.\nComparatively, video models deal with the added temporal dimension, increasing\ncomplexity and posing challenges in identifying dynamic concepts over time. In\nthis work, we systematically address these challenges by introducing the first\nVideo Transformer Concept Discovery (VTCD) algorithm. To this end, we propose\nan efficient approach for unsupervised identification of units of video\ntransformer representations - concepts, and ranking their importance to the\noutput of a model. The resulting concepts are highly interpretable, revealing\nspatio-temporal reasoning mechanisms and object-centric representations in\nunstructured video models. Performing this analysis jointly over a diverse set\nof supervised and self-supervised representations, we discover that some of\nthese mechanism are universal in video transformers. Finally, we demonstrate\nthat VTCDcan be used to improve model performance for fine-grained tasks.\n","authors":["Matthew Kowal","Achal Dave","Rares Ambrus","Adrien Gaidon","Konstantinos G. Derpanis","Pavel Tokmakov"],"pdf_url":"https://arxiv.org/pdf/2401.10831v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10825v1","updated":"2024-01-19T17:21:05Z","published":"2024-01-19T17:21:05Z","title":"A survey on recent advances in named entity recognition","summary":" Named Entity Recognition seeks to extract substrings within a text that name\nreal-world objects and to determine their type (for example, whether they refer\nto persons or organizations). In this survey, we first present an overview of\nrecent popular approaches, but we also look at graph- and transformer- based\nmethods including Large Language Models (LLMs) that have not had much coverage\nin other surveys. Second, we focus on methods designed for datasets with scarce\nannotations. Third, we evaluate the performance of the main NER implementations\non a variety of datasets with differing characteristics (as regards their\ndomain, their size, and their number of classes). We thus provide a deep\ncomparison of algorithms that are never considered together. Our experiments\nshed some light on how the characteristics of datasets affect the behavior of\nthe methods that we compare.\n","authors":["Imed Keraghel","Stanislas Morbieu","Mohamed Nadif"],"pdf_url":"https://arxiv.org/pdf/2401.10825v1.pdf","comment":"30 pages"},{"id":"http://arxiv.org/abs/2310.12955v2","updated":"2024-01-19T17:12:23Z","published":"2023-10-19T17:54:39Z","title":"Towards Robust Offline Reinforcement Learning under Diverse Data\n Corruption","summary":" Offline reinforcement learning (RL) presents a promising approach for\nlearning reinforced policies from offline datasets without the need for costly\nor unsafe interactions with the environment. However, datasets collected by\nhumans in real-world environments are often noisy and may even be maliciously\ncorrupted, which can significantly degrade the performance of offline RL. In\nthis work, we first investigate the performance of current offline RL\nalgorithms under comprehensive data corruption, including states, actions,\nrewards, and dynamics. Our extensive experiments reveal that implicit\nQ-learning (IQL) demonstrates remarkable resilience to data corruption among\nvarious offline RL algorithms. Furthermore, we conduct both empirical and\ntheoretical analyses to understand IQL's robust performance, identifying its\nsupervised policy learning scheme as the key factor. Despite its relative\nrobustness, IQL still suffers from heavy-tail targets of Q functions under\ndynamics corruption. To tackle this challenge, we draw inspiration from robust\nstatistics to employ the Huber loss to handle the heavy-tailedness and utilize\nquantile estimators to balance penalization for corrupted data and learning\nstability. By incorporating these simple yet effective modifications into IQL,\nwe propose a more robust offline RL approach named Robust IQL (RIQL). Extensive\nexperiments demonstrate that RIQL exhibits highly robust performance when\nsubjected to diverse data corruption scenarios.\n","authors":["Rui Yang","Han Zhong","Jiawei Xu","Amy Zhang","Chongjie Zhang","Lei Han","Tong Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.12955v2.pdf","comment":"Accepted by ICLR 2024"},{"id":"http://arxiv.org/abs/2401.10819v1","updated":"2024-01-19T17:09:32Z","published":"2024-01-19T17:09:32Z","title":"Optimisation in Neurosymbolic Learning Systems","summary":" Neurosymbolic AI aims to integrate deep learning with symbolic AI. This\nintegration has many promises, such as decreasing the amount of data required\nto train a neural network, improving the explainability and interpretability of\nanswers given by models and verifying the correctness of trained systems. We\nstudy neurosymbolic learning, where we have both data and background knowledge\nexpressed using symbolic languages. How do we connect the symbolic and neural\ncomponents to communicate this knowledge? One option is fuzzy reasoning, which\nstudies degrees of truth. For example, being tall is not a binary concept.\nInstead, probabilistic reasoning studies the probability that something is true\nor will happen. Our first research question studies how different forms of\nfuzzy reasoning combine with learning. We find surprising results like a\nconnection to the Raven paradox stating we confirm \"ravens are black\" when we\nobserve a green apple. In this study, we did not use the background knowledge\nwhen we deployed our models after training. In our second research question, we\nstudied how to use background knowledge in deployed models. We developed a new\nneural network layer based on fuzzy reasoning. Probabilistic reasoning is a\nnatural fit for neural networks, which we usually train to be probabilistic.\nHowever, they are expensive to compute and do not scale well to large tasks. In\nour third research question, we study how to connect probabilistic reasoning\nwith neural networks by sampling to estimate averages, while in the final\nresearch question, we study scaling probabilistic neurosymbolic learning to\nmuch larger problems than before. Our insight is to train a neural network with\nsynthetic data to predict the result of probabilistic reasoning.\n","authors":["Emile van Krieken"],"pdf_url":"https://arxiv.org/pdf/2401.10819v1.pdf","comment":"PhD dissertation"},{"id":"http://arxiv.org/abs/2401.10816v1","updated":"2024-01-19T17:03:37Z","published":"2024-01-19T17:03:37Z","title":"Co-Pilot for Health: Personalized Algorithmic AI Nudging to Improve\n Health Outcomes","summary":" The ability to shape health behaviors of large populations automatically,\nacross wearable types and disease conditions at scale has tremendous potential\nto improve global health outcomes. We designed and implemented an AI driven\nplatform for digital algorithmic nudging, enabled by a Graph-Neural Network\n(GNN) based Recommendation System, and granular health behavior data from\nwearable fitness devices. Here we describe the efficacy results of this\nplatform with its capabilities of personalized and contextual nudging to\n$n=84,764$ individuals over a 12-week period in Singapore. We statistically\nvalidated that participants in the target group who received such AI optimized\ndaily nudges increased daily physical activity like step count by 6.17% ($p =\n3.09\\times10^{-4}$) and weekly minutes of Moderate to Vigorous Physical\nActivity (MVPA) by 7.61% ($p = 1.16\\times10^{-2}$), compared to matched\nparticipants in control group who did not receive any nudges. Further, such\nnudges were very well received, with a 13.1% of nudges sent being opened (open\nrate), and 11.7% of the opened nudges rated useful compared to 1.9% rated as\nnot useful thereby demonstrating significant improvement in population level\nengagement metrics.\n","authors":["Jodi Chiam","Aloysius Lim","Cheryl Nott","Nicholas Mark","Ankur Teredesai","Sunil Shinde"],"pdf_url":"https://arxiv.org/pdf/2401.10816v1.pdf","comment":"19 pages, 2 figures"},{"id":"http://arxiv.org/abs/2401.10811v1","updated":"2024-01-19T16:56:11Z","published":"2024-01-19T16:56:11Z","title":"Simulation Based Bayesian Optimization","summary":" Bayesian Optimization (BO) is a powerful method for optimizing black-box\nfunctions by combining prior knowledge with ongoing function evaluations. BO\nconstructs a probabilistic surrogate model of the objective function given the\ncovariates, which is in turn used to inform the selection of future evaluation\npoints through an acquisition function. For smooth continuous search spaces,\nGaussian Processes (GPs) are commonly used as the surrogate model as they offer\nanalytical access to posterior predictive distributions, thus facilitating the\ncomputation and optimization of acquisition functions. However, in complex\nscenarios involving optimizations over categorical or mixed covariate spaces,\nGPs may not be ideal.\n This paper introduces Simulation Based Bayesian Optimization (SBBO) as a\nnovel approach to optimizing acquisition functions that only requires\n\\emph{sampling-based} access to posterior predictive distributions. SBBO allows\nthe use of surrogate probabilistic models tailored for combinatorial spaces\nwith discrete variables. Any Bayesian model in which posterior inference is\ncarried out through Markov chain Monte Carlo can be selected as the surrogate\nmodel in SBBO. In applications involving combinatorial optimization, we\ndemonstrate empirically the effectiveness of SBBO method using various choices\nof surrogate models.\n","authors":["Roi Naveiro","Becky Tang"],"pdf_url":"https://arxiv.org/pdf/2401.10811v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10809v1","updated":"2024-01-19T16:52:53Z","published":"2024-01-19T16:52:53Z","title":"Neglected Hessian component explains mysteries in Sharpness\n regularization","summary":" Recent work has shown that methods like SAM which either explicitly or\nimplicitly penalize second order information can improve generalization in deep\nlearning. Seemingly similar methods like weight noise and gradient penalties\noften fail to provide such benefits. We show that these differences can be\nexplained by the structure of the Hessian of the loss. First, we show that a\ncommon decomposition of the Hessian can be quantitatively interpreted as\nseparating the feature exploitation from feature exploration. The feature\nexploration, which can be described by the Nonlinear Modeling Error matrix\n(NME), is commonly neglected in the literature since it vanishes at\ninterpolation. Our work shows that the NME is in fact important as it can\nexplain why gradient penalties are sensitive to the choice of activation\nfunction. Using this insight we design interventions to improve performance. We\nalso provide evidence that challenges the long held equivalence of weight noise\nand gradient penalties. This equivalence relies on the assumption that the NME\ncan be ignored, which we find does not hold for modern networks since they\ninvolve significant feature learning. We find that regularizing feature\nexploitation but not feature exploration yields performance similar to gradient\npenalties.\n","authors":["Yann N. Dauphin","Atish Agarwala","Hossein Mobahi"],"pdf_url":"https://arxiv.org/pdf/2401.10809v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2208.07626v3","updated":"2024-01-19T16:52:27Z","published":"2022-08-16T09:24:47Z","title":"Algorithmic Assistance with Recommendation-Dependent Preferences","summary":" When an algorithm provides risk assessments, we typically think of them as\nhelpful inputs to human decisions, such as when risk scores are presented to\njudges or doctors. However, a decision-maker may not only react to the\ninformation provided by the algorithm. The decision-maker may also view the\nalgorithmic recommendation as a default action, making it costly for them to\ndeviate, such as when a judge is reluctant to overrule a high-risk assessment\nfor a defendant or a doctor fears the consequences of deviating from\nrecommended procedures. To address such unintended consequences of algorithmic\nassistance, we propose a principal-agent model of joint human-machine\ndecision-making. Within this model, we consider the effect and design of\nalgorithmic recommendations when they affect choices not just by shifting\nbeliefs, but also by altering preferences. We motivate this assumption from\ninstitutional factors, such as a desire to avoid audits, as well as from\nwell-established models in behavioral science that predict loss aversion\nrelative to a reference point, which here is set by the algorithm. We show that\nrecommendation-dependent preferences create inefficiencies where the\ndecision-maker is overly responsive to the recommendation. As a potential\nremedy, we discuss algorithms that strategically withhold recommendations, and\nshow how they can improve the quality of final decisions.\n","authors":["Bryce McLaughlin","Jann Spiess"],"pdf_url":"https://arxiv.org/pdf/2208.07626v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10805v1","updated":"2024-01-19T16:48:49Z","published":"2024-01-19T16:48:49Z","title":"Learning to Visually Connect Actions and their Effects","summary":" In this work, we introduce the novel concept of visually Connecting Actions\nand Their Effects (CATE) in video understanding. CATE can have applications in\nareas like task planning and learning from demonstration. We propose different\nCATE-based task formulations, such as action selection and action\nspecification, where video understanding models connect actions and effects at\nsemantic and fine-grained levels. We observe that different formulations\nproduce representations capturing intuitive action properties. We also design\nvarious baseline models for action selection and action specification. Despite\nthe intuitive nature of the task, we observe that models struggle, and humans\noutperform them by a large margin. The study aims to establish a foundation for\nfuture efforts, showcasing the flexibility and versatility of connecting\nactions and effects in video understanding, with the hope of inspiring advanced\nformulations and models.\n","authors":["Eric Peh","Paritosh Parmar","Basura Fernando"],"pdf_url":"https://arxiv.org/pdf/2401.10805v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10800v1","updated":"2024-01-19T16:36:27Z","published":"2024-01-19T16:36:27Z","title":"Estimation of AMOC transition probabilities using a machine learning\n based rare-event algorithm","summary":" The Atlantic Meridional Overturning Circulation (AMOC) is an important\ncomponent of the global climate, known to be a tipping element, as it could\ncollapse under global warming. The main objective of this study is to compute\nthe probability that the AMOC collapses within a specified time window, using a\nrare-event algorithm called Trajectory-Adaptive Multilevel Splitting (TAMS).\nHowever, the efficiency and accuracy of TAMS depend on the choice of the score\nfunction. Although the definition of the optimal score function, called\n``committor function\" is known, it is impossible in general to compute it a\npriori. Here, we combine TAMS with a Next-Generation Reservoir Computing\ntechnique that estimates the committor function from the data generated by the\nrare-event algorithm. We test this technique in a stochastic box model of the\nAMOC for which two types of transition exist, the so-called F(ast)-transitions\nand S(low)-transitions. Results for the F-transtions compare favorably with\nthose in the literature where a physically-informed score function was used. We\nshow that coupling a rare-event algorithm with machine learning allows for a\ncorrect estimation of transition probabilities, transition times, and even\ntransition paths for a wide range of model parameters. We then extend these\nresults to the more difficult problem of S-transitions in the same model. In\nboth cases of F- and S-transitions, we also show how the Next-Generation\nReservoir Computing technique can be interpreted to retrieve an analytical\nestimate of the committor function.\n","authors":["Valérian Jacques-Dumas","René M. van Westen","Henk A. Dijkstra"],"pdf_url":"https://arxiv.org/pdf/2401.10800v1.pdf","comment":"16 pages, 9 figures"},{"id":"http://arxiv.org/abs/2401.10799v1","updated":"2024-01-19T16:34:37Z","published":"2024-01-19T16:34:37Z","title":"Novel Representation Learning Technique using Graphs for Performance\n Analytics","summary":" The performance analytics domain in High Performance Computing (HPC) uses\ntabular data to solve regression problems, such as predicting the execution\ntime. Existing Machine Learning (ML) techniques leverage the correlations among\nfeatures given tabular datasets, not leveraging the relationships between\nsamples directly. Moreover, since high-quality embeddings from raw features\nimprove the fidelity of the downstream predictive models, existing methods rely\non extensive feature engineering and pre-processing steps, costing time and\nmanual effort. To fill these two gaps, we propose a novel idea of transforming\ntabular performance data into graphs to leverage the advancement of Graph\nNeural Network-based (GNN) techniques in capturing complex relationships\nbetween features and samples. In contrast to other ML application domains, such\nas social networks, the graph is not given; instead, we need to build it. To\naddress this gap, we propose graph-building methods where nodes represent\nsamples, and the edges are automatically inferred iteratively based on the\nsimilarity between the features in the samples. We evaluate the effectiveness\nof the generated embeddings from GNNs based on how well they make even a simple\nfeed-forward neural network perform for regression tasks compared to other\nstate-of-the-art representation learning techniques. Our evaluation\ndemonstrates that even with up to 25% random missing values for each dataset,\nour method outperforms commonly used graph and Deep Neural Network (DNN)-based\napproaches and achieves up to 61.67% & 78.56% improvement in MSE loss over the\nDNN baseline respectively for HPC dataset and Machine Learning Datasets.\n","authors":["Tarek Ramadan","Ankur Lahiry","Tanzima Z. Islam"],"pdf_url":"https://arxiv.org/pdf/2401.10799v1.pdf","comment":"This paper has been accepted at 22nd International Conference on\n Machine Learning and Applications (ICMLA2023)"},{"id":"http://arxiv.org/abs/2201.05158v3","updated":"2024-01-19T16:26:46Z","published":"2022-01-13T16:35:45Z","title":"Towards Quantum Graph Neural Networks: An Ego-Graph Learning Approach","summary":" Quantum machine learning is a fast-emerging field that aims to tackle machine\nlearning using quantum algorithms and quantum computing. Due to the lack of\nphysical qubits and an effective means to map real-world data from Euclidean\nspace to Hilbert space, most of these methods focus on quantum analogies or\nprocess simulations rather than devising concrete architectures based on\nqubits. In this paper, we propose a novel hybrid quantum-classical algorithm\nfor graph-structured data, which we refer to as the Ego-graph based Quantum\nGraph Neural Network (egoQGNN). egoQGNN implements the GNN theoretical\nframework using the tensor product and unity matrix representation, which\ngreatly reduces the number of model parameters required. When controlled by a\nclassical computer, egoQGNN can accommodate arbitrarily sized graphs by\nprocessing ego-graphs from the input graph using a modestly-sized quantum\ndevice. The architecture is based on a novel mapping from real-world data to\nHilbert space. This mapping maintains the distance relations present in the\ndata and reduces information loss. Experimental results show that the proposed\nmethod outperforms competitive state-of-the-art models with only 1.68\\%\nparameters compared to those models.\n","authors":["Xing Ai","Zhihong Zhang","Luzhe Sun","Junchi Yan","Edwin Hancock"],"pdf_url":"https://arxiv.org/pdf/2201.05158v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10794v1","updated":"2024-01-19T16:26:35Z","published":"2024-01-19T16:26:35Z","title":"Deep Reinforcement Learning Empowered Activity-Aware Dynamic Health\n Monitoring Systems","summary":" In smart healthcare, health monitoring utilizes diverse tools and\ntechnologies to analyze patients' real-time biosignal data, enabling immediate\nactions and interventions. Existing monitoring approaches were designed on the\npremise that medical devices track several health metrics concurrently,\ntailored to their designated functional scope. This means that they report all\nrelevant health values within that scope, which can result in excess resource\nuse and the gathering of extraneous data due to monitoring irrelevant health\nmetrics. In this context, we propose Dynamic Activity-Aware Health Monitoring\nstrategy (DActAHM) for striking a balance between optimal monitoring\nperformance and cost efficiency, a novel framework based on Deep Reinforcement\nLearning (DRL) and SlowFast Model to ensure precise monitoring based on users'\nactivities. Specifically, with the SlowFast Model, DActAHM efficiently\nidentifies individual activities and captures these results for enhanced\nprocessing. Subsequently, DActAHM refines health metric monitoring in response\nto the identified activity by incorporating a DRL framework. Extensive\nexperiments comparing DActAHM against three state-of-the-art approaches\ndemonstrate it achieves 27.3% higher gain than the best-performing baseline\nthat fixes monitoring actions over timeline.\n","authors":["Ziqiaing Ye","Yulan Gao","Yue Xiao","Zehui Xiong","Dusit Niyato"],"pdf_url":"https://arxiv.org/pdf/2401.10794v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10791v1","updated":"2024-01-19T16:23:53Z","published":"2024-01-19T16:23:53Z","title":"Early alignment in two-layer networks training is a two-edged sword","summary":" Training neural networks with first order optimisation methods is at the core\nof the empirical success of deep learning. The scale of initialisation is a\ncrucial factor, as small initialisations are generally associated to a feature\nlearning regime, for which gradient descent is implicitly biased towards simple\nsolutions. This work provides a general and quantitative description of the\nearly alignment phase, originally introduced by Maennel et al. (2018) . For\nsmall initialisation and one hidden ReLU layer networks, the early stage of the\ntraining dynamics leads to an alignment of the neurons towards key directions.\nThis alignment induces a sparse representation of the network, which is\ndirectly related to the implicit bias of gradient flow at convergence. This\nsparsity inducing alignment however comes at the expense of difficulties in\nminimising the training objective: we also provide a simple data example for\nwhich overparameterised networks fail to converge towards global minima and\nonly converge to a spurious stationary point instead.\n","authors":["Etienne Boursier","Nicolas Flammarion"],"pdf_url":"https://arxiv.org/pdf/2401.10791v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10790v1","updated":"2024-01-19T16:21:55Z","published":"2024-01-19T16:21:55Z","title":"Measuring the Impact of Scene Level Objects on Object Detection: Towards\n Quantitative Explanations of Detection Decisions","summary":" Although accuracy and other common metrics can provide a useful window into\nthe performance of an object detection model, they lack a deeper view of the\nmodel's decision process. Regardless of the quality of the training data and\nprocess, the features that an object detection model learns cannot be\nguaranteed. A model may learn a relationship between certain background\ncontext, i.e., scene level objects, and the presence of the labeled classes.\nFurthermore, standard performance verification and metrics would not identify\nthis phenomenon. This paper presents a new black box explainability method for\nadditional verification of object detection models by finding the impact of\nscene level objects on the identification of the objects within the image. By\ncomparing the accuracies of a model on test data with and without certain scene\nlevel objects, the contributions of these objects to the model's performance\nbecomes clearer. The experiment presented here will assess the impact of\nbuildings and people in image context on the detection of emergency road\nvehicles by a fine-tuned YOLOv8 model. A large increase in accuracy in the\npresence of a scene level object will indicate the model's reliance on that\nobject to make its detections. The results of this research lead to providing a\nquantitative explanation of the object detection model's decision process,\nenabling a deeper understanding of the model's performance.\n","authors":["Lynn Vonder Haar","Timothy Elvira","Luke Newcomb","Omar Ochoa"],"pdf_url":"https://arxiv.org/pdf/2401.10790v1.pdf","comment":"9 pages, 4 figures, 1 table"},{"id":"http://arxiv.org/abs/2401.07961v2","updated":"2024-01-19T15:55:16Z","published":"2024-01-15T20:57:50Z","title":"Solution of the Probabilistic Lambert Problem: Connections with Optimal\n Mass Transport, Schrödinger Bridge and Reaction-Diffusion PDEs","summary":" Lambert's problem concerns with transferring a spacecraft from a given\ninitial to a given terminal position within prescribed flight time via velocity\ncontrol subject to a gravitational force field. We consider a probabilistic\nvariant of the Lambert problem where the knowledge of the endpoint constraints\nin position vectors are replaced by the knowledge of their respective joint\nprobability density functions. We show that the Lambert problem with endpoint\njoint probability density constraints is a generalized optimal mass transport\n(OMT) problem, thereby connecting this classical astrodynamics problem with a\nburgeoning area of research in modern stochastic control and stochastic machine\nlearning. This newfound connection allows us to rigorously establish the\nexistence and uniqueness of solution for the probabilistic Lambert problem. The\nsame connection also helps to numerically solve the probabilistic Lambert\nproblem via diffusion regularization, i.e., by leveraging further connection of\nthe OMT with the Schr\\\"odinger bridge problem (SBP). This also shows that the\nprobabilistic Lambert problem with additive dynamic process noise is in fact a\ngeneralized SBP, and can be solved numerically using the so-called\nSchr\\\"odinger factors, as we do in this work. We explain how the resulting\nanalysis leads to solving a boundary-coupled system of reaction-diffusion PDEs\nwhere the nonlinear gravitational potential appears as the reaction rate. We\npropose novel algorithms for the same, and present illustrative numerical\nresults. Our analysis and the algorithmic framework are nonparametric, i.e., we\nmake neither statistical (e.g., Gaussian, first few moments, mixture or\nexponential family, finite dimensionality of the sufficient statistic) nor\ndynamical (e.g., Taylor series) approximations.\n","authors":["Alexis M. H. Teter","Iman Nodozi","Abhishek Halder"],"pdf_url":"https://arxiv.org/pdf/2401.07961v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10774v1","updated":"2024-01-19T15:48:40Z","published":"2024-01-19T15:48:40Z","title":"Medusa: Simple LLM Inference Acceleration Framework with Multiple\n Decoding Heads","summary":" The inference process in Large Language Models (LLMs) is often limited due to\nthe absence of parallelism in the auto-regressive decoding process, resulting\nin most operations being restricted by the memory bandwidth of accelerators.\nWhile methods such as speculative decoding have been suggested to address this\nissue, their implementation is impeded by the challenges associated with\nacquiring and maintaining a separate draft model. In this paper, we present\nMedusa, an efficient method that augments LLM inference by adding extra\ndecoding heads to predict multiple subsequent tokens in parallel. Using a\ntree-based attention mechanism, Medusa constructs multiple candidate\ncontinuations and verifies them simultaneously in each decoding step. By\nleveraging parallel processing, Medusa introduces only minimal overhead in\nterms of single-step latency while substantially reducing the number of\ndecoding steps required.\n We present two levels of fine-tuning procedures for Medusa to meet the needs\nof different use cases: Medusa-1: Medusa is directly fine-tuned on top of a\nfrozen backbone LLM, enabling lossless inference acceleration. Medusa-2: Medusa\nis fine-tuned together with the backbone LLM, enabling better prediction\naccuracy of Medusa heads and higher speedup but needing a special training\nrecipe that preserves the backbone model's capabilities.\n Moreover, we propose several extensions that improve or expand the utility of\nMedusa, including a self-distillation to handle situations where no training\ndata is available and a typical acceptance scheme to boost the acceptance rate\nwhile maintaining generation quality. We evaluate Medusa on models of various\nsizes and training procedures. Our experiments demonstrate that Medusa-1 can\nachieve over 2.2x speedup without compromising generation quality, while\nMedusa-2 further improves the speedup to 2.3-3.6x.\n","authors":["Tianle Cai","Yuhong Li","Zhengyang Geng","Hongwu Peng","Jason D. Lee","Deming Chen","Tri Dao"],"pdf_url":"https://arxiv.org/pdf/2401.10774v1.pdf","comment":"The code for this implementation is available at\n https://github.com/FasterDecoding/Medusa"},{"id":"http://arxiv.org/abs/2401.10765v1","updated":"2024-01-19T15:37:11Z","published":"2024-01-19T15:37:11Z","title":"Starlit: Privacy-Preserving Federated Learning to Enhance Financial\n Fraud Detection","summary":" Federated Learning (FL) is a data-minimization approach enabling\ncollaborative model training across diverse clients with local data, avoiding\ndirect data exchange. However, state-of-the-art FL solutions to identify\nfraudulent financial transactions exhibit a subset of the following\nlimitations. They (1) lack a formal security definition and proof, (2) assume\nprior freezing of suspicious customers' accounts by financial institutions\n(limiting the solutions' adoption), (3) scale poorly, involving either $O(n^2)$\ncomputationally expensive modular exponentiation (where $n$ is the total number\nof financial institutions) or highly inefficient fully homomorphic encryption,\n(4) assume the parties have already completed the identity alignment phase,\nhence excluding it from the implementation, performance evaluation, and\nsecurity analysis, and (5) struggle to resist clients' dropouts. This work\nintroduces Starlit, a novel scalable privacy-preserving FL mechanism that\novercomes these limitations. It has various applications, such as enhancing\nfinancial fraud detection, mitigating terrorism, and enhancing digital health.\nWe implemented Starlit and conducted a thorough performance analysis using\nsynthetic data from a key player in global financial transactions. The\nevaluation indicates Starlit's scalability, efficiency, and accuracy.\n","authors":["Aydin Abadi","Bradley Doyle","Francesco Gini","Kieron Guinamard","Sasi Kumar Murakonda","Jack Liddell","Paul Mellor","Steven J. Murdoch","Mohammad Naseri","Hector Page","George Theodorakopoulos","Suzanne Weller"],"pdf_url":"https://arxiv.org/pdf/2401.10765v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.17046v2","updated":"2024-01-19T15:33:12Z","published":"2023-03-29T22:18:47Z","title":"Have it your way: Individualized Privacy Assignment for DP-SGD","summary":" When training a machine learning model with differential privacy, one sets a\nprivacy budget. This budget represents a maximal privacy violation that any\nuser is willing to face by contributing their data to the training set. We\nargue that this approach is limited because different users may have different\nprivacy expectations. Thus, setting a uniform privacy budget across all points\nmay be overly conservative for some users or, conversely, not sufficiently\nprotective for others. In this paper, we capture these preferences through\nindividualized privacy budgets. To demonstrate their practicality, we introduce\na variant of Differentially Private Stochastic Gradient Descent (DP-SGD) which\nsupports such individualized budgets. DP-SGD is the canonical approach to\ntraining models with differential privacy. We modify its data sampling and\ngradient noising mechanisms to arrive at our approach, which we call\nIndividualized DP-SGD (IDP-SGD). Because IDP-SGD provides privacy guarantees\ntailored to the preferences of individual users and their data points, we find\nit empirically improves privacy-utility trade-offs.\n","authors":["Franziska Boenisch","Christopher Mühl","Adam Dziedzic","Roy Rinberg","Nicolas Papernot"],"pdf_url":"https://arxiv.org/pdf/2303.17046v2.pdf","comment":"Published at NeurIPS'2024"},{"id":"http://arxiv.org/abs/2205.14102v3","updated":"2024-01-19T15:30:04Z","published":"2022-05-27T17:12:26Z","title":"Group-level Brain Decoding with Deep Learning","summary":" Decoding brain imaging data are gaining popularity, with applications in\nbrain-computer interfaces and the study of neural representations. Decoding is\ntypicallysubject-specific and does not generalise well over subjects, due to\nhigh amounts ofbetween subject variability. Techniques that overcome this will\nnot only providericher neuroscientific insights but also make it possible for\ngroup-level models to out-perform subject-specific models. Here, we propose a\nmethod that uses subjectembedding, analogous to word embedding in natural\nlanguage processing, to learnand exploit the structure in between-subject\nvariability as part of a decoding model,our adaptation of the WaveNet\narchitecture for classification. We apply this to mag-netoencephalography data,\nwhere 15 subjects viewed 118 different images, with30 examples per image; to\nclassify images using the entire 1 s window followingimage presentation. We\nshow that the combination of deep learning and subjectembedding is crucial to\nclosing the performance gap between subject- and group-level decoding models.\nImportantly, group models outperform subject models onlow-accuracy subjects\n(although slightly impair high-accuracy subjects) and can behelpful for\ninitialising subject models. While we have not generally found\ngroup-levelmodels to perform better than subject-level models, the performance\nof groupmodelling is expected to be even higher with bigger datasets. In order\nto providephysiological interpretation at the group level, we make use of\npermutation featureimportance. This provides insights into the spatiotemporal\nand spectral informationencoded in the models. All code is available on GitHub\n(https://github.com/ricsinaruto/MEG-group-decode).\n","authors":["Richard Csaky","Mats Van Es","Oiwi Parker Jones","Mark Woolrich"],"pdf_url":"https://arxiv.org/pdf/2205.14102v3.pdf","comment":"Published in Human Brain Mapping"},{"id":"http://arxiv.org/abs/2401.10754v1","updated":"2024-01-19T15:25:09Z","published":"2024-01-19T15:25:09Z","title":"Data Augmentation for Traffic Classification","summary":" Data Augmentation (DA) -- enriching training data by adding synthetic samples\n-- is a technique widely adopted in Computer Vision (CV) and Natural Language\nProcessing (NLP) tasks to improve models performance. Yet, DA has struggled to\ngain traction in networking contexts, particularly in Traffic Classification\n(TC) tasks. In this work, we fulfill this gap by benchmarking 18 augmentation\nfunctions applied to 3 TC datasets using packet time series as input\nrepresentation and considering a variety of training conditions. Our results\nshow that (i) DA can reap benefits previously unexplored with (ii)\naugmentations acting on time series sequence order and masking being a better\nsuit for TC and (iii) simple latent space analysis can provide hints about why\naugmentations have positive or negative effects.\n","authors":["Chao Wang","Alessandro Finamore","Pietro Michiardi","Massimo Gallo","Dario Rossi"],"pdf_url":"https://arxiv.org/pdf/2401.10754v1.pdf","comment":"to appear at Passive and Active Measurements (PAM), 2024"},{"id":"http://arxiv.org/abs/2401.10753v1","updated":"2024-01-19T15:22:28Z","published":"2024-01-19T15:22:28Z","title":"BoolGebra: Attributed Graph-learning for Boolean Algebraic Manipulation","summary":" Boolean algebraic manipulation is at the core of logic synthesis in\nElectronic Design Automation (EDA) design flow. Existing methods struggle to\nfully exploit optimization opportunities, and often suffer from an explosive\nsearch space and limited scalability efficiency. This work presents BoolGebra,\na novel attributed graph-learning approach for Boolean algebraic manipulation\nthat aims to improve fundamental logic synthesis. BoolGebra incorporates Graph\nNeural Networks (GNNs) and takes initial feature embeddings from both\nstructural and functional information as inputs. A fully connected neural\nnetwork is employed as the predictor for direct optimization result\npredictions, significantly reducing the search space and efficiently locating\nthe optimization space. The experiments involve training the BoolGebra model\nw.r.t design-specific and cross-design inferences using the trained model,\nwhere BoolGebra demonstrates generalizability for cross-design inference and\nits potential to scale from small, simple training datasets to large, complex\ninference datasets. Finally, BoolGebra is integrated with existing synthesis\ntool ABC to perform end-to-end logic minimization evaluation w.r.t SOTA\nbaselines.\n","authors":["Yingjie Li","Anthony Agnesina","Yanqing Zhang","Haoxing Ren","Cunxi Yu"],"pdf_url":"https://arxiv.org/pdf/2401.10753v1.pdf","comment":"DATE 2024 extended version. arXiv admin note: text overlap with\n arXiv:2310.07846"},{"id":"http://arxiv.org/abs/2310.13384v2","updated":"2024-01-19T15:19:54Z","published":"2023-10-20T09:53:55Z","title":"Salted Inference: Enhancing Privacy while Maintaining Efficiency of\n Split Inference in Mobile Computing","summary":" In split inference, a deep neural network (DNN) is partitioned to run the\nearly part of the DNN at the edge and the later part of the DNN in the cloud.\nThis meets two key requirements for on-device machine learning: input privacy\nand computation efficiency. Still, an open question in split inference is\noutput privacy, given that the outputs of the DNN are observable in the cloud.\nWhile encrypted computing can protect output privacy too, homomorphic\nencryption requires substantial computation and communication resources from\nboth edge and cloud devices. In this paper, we introduce Salted DNNs: a novel\napproach that enables clients at the edge, who run the early part of the DNN,\nto control the semantic interpretation of the DNN's outputs at inference time.\nOur proposed Salted DNNs maintain classification accuracy and computation\nefficiency very close to the standard DNN counterparts. Experimental\nevaluations conducted on both images and wearable sensor data demonstrate that\nSalted DNNs attain classification accuracy very close to standard DNNs,\nparticularly when the Salted Layer is positioned within the early part to meet\nthe requirements of split inference. Our approach is general and can be applied\nto various types of DNNs. As a benchmark for future studies, we open-source our\ncode.\n","authors":["Mohammad Malekzadeh","Fahim Kawsar"],"pdf_url":"https://arxiv.org/pdf/2310.13384v2.pdf","comment":"To be appeared in the 25th International Workshop on Mobile Computing\n Systems and Applications (HotMobile 2024)"},{"id":"http://arxiv.org/abs/2305.03077v2","updated":"2024-01-19T15:16:37Z","published":"2023-05-04T18:00:01Z","title":"Explaining dark matter halo density profiles with neural networks","summary":" We use explainable neural networks to connect the evolutionary history of\ndark matter halos with their density profiles. The network captures independent\nfactors of variation in the density profiles within a low-dimensional\nrepresentation, which we physically interpret using mutual information. Without\nany prior knowledge of the halos' evolution, the network recovers the known\nrelation between the early time assembly and the inner profile, and discovers\nthat the profile beyond the virial radius is described by a single parameter\ncapturing the most recent mass accretion rate. The results illustrate the\npotential for machine-assisted scientific discovery in complicated\nastrophysical datasets.\n","authors":["Luisa Lucie-Smith","Hiranya V. Peiris","Andrew Pontzen"],"pdf_url":"https://arxiv.org/pdf/2305.03077v2.pdf","comment":"7 pages, 5 figures. Minor changes to match version accepted for\n publication in PRL"},{"id":"http://arxiv.org/abs/2401.10746v1","updated":"2024-01-19T15:13:30Z","published":"2024-01-19T15:13:30Z","title":"A Systematic Evaluation of Euclidean Alignment with Deep Learning for\n EEG Decoding","summary":" Electroencephalography (EEG) signals are frequently used for various\nBrain-Computer Interface (BCI) tasks. While Deep Learning (DL) techniques have\nshown promising results, they are hindered by the substantial data\nrequirements. By leveraging data from multiple subjects, transfer learning\nenables more effective training of DL models. A technique that is gaining\npopularity is Euclidean Alignment (EA) due to its ease of use, low\ncomputational complexity, and compatibility with Deep Learning models. However,\nfew studies evaluate its impact on the training performance of shared and\nindividual DL models. In this work, we systematically evaluate the effect of EA\ncombined with DL for decoding BCI signals. We used EA to train shared models\nwith data from multiple subjects and evaluated its transferability to new\nsubjects. Our experimental results show that it improves decoding in the target\nsubject by 4.33% and decreases convergence time by more than 70%. We also\ntrained individual models for each subject to use as a majority-voting ensemble\nclassifier. In this scenario, using EA improved the 3-model ensemble accuracy\nby 3.7%. However, when compared to the shared model with EA, the ensemble\naccuracy was 3.62% lower.\n","authors":["Bruna Junqueira","Bruno Aristimunha","Sylvain Chevallier","Raphael Y. de Camargo"],"pdf_url":"https://arxiv.org/pdf/2401.10746v1.pdf","comment":"14 pages and 10 figures"},{"id":"http://arxiv.org/abs/2401.09796v2","updated":"2024-01-19T15:09:45Z","published":"2024-01-18T08:33:09Z","title":"A Fast, Performant, Secure Distributed Training Framework For Large\n Language Model","summary":" The distributed (federated) LLM is an important method for co-training the\ndomain-specific LLM using siloed data. However, maliciously stealing model\nparameters and data from the server or client side has become an urgent problem\nto be solved. In this paper, we propose a secure distributed LLM based on model\nslicing. In this case, we deploy the Trusted Execution Environment (TEE) on\nboth the client and server side, and put the fine-tuned structure (LoRA or\nembedding of P-tuning v2) into the TEE. Then, secure communication is executed\nin the TEE and general environments through lightweight encryption. In order to\nfurther reduce the equipment cost as well as increase the model performance and\naccuracy, we propose a split fine-tuning scheme. In particular, we split the\nLLM by layers and place the latter layers in a server-side TEE (the client does\nnot need a TEE). We then combine the proposed Sparsification Parameter\nFine-tuning (SPF) with the LoRA part to improve the accuracy of the downstream\ntask. Numerous experiments have shown that our method guarantees accuracy while\nmaintaining security.\n","authors":["Wei Huang","Yinggui Wang","Anda Cheng","Aihui Zhou","Chaofan Yu","Lei Wang"],"pdf_url":"https://arxiv.org/pdf/2401.09796v2.pdf","comment":"Accepted by ICASSP 2024 (Federated LLM)"},{"id":"http://arxiv.org/abs/2306.17248v2","updated":"2024-01-19T15:01:52Z","published":"2023-06-29T18:34:37Z","title":"TemperatureGAN: Generative Modeling of Regional Atmospheric Temperatures","summary":" Stochastic generators are useful for estimating climate impacts on various\nsectors. Projecting climate risk in various sectors, e.g. energy systems,\nrequires generators that are accurate (statistical resemblance to\nground-truth), reliable (do not produce erroneous examples), and efficient.\nLeveraging data from the North American Land Data Assimilation System, we\nintroduce TemperatureGAN, a Generative Adversarial Network conditioned on\nmonths, locations, and time periods, to generate 2m above ground atmospheric\ntemperatures at an hourly resolution. We propose evaluation methods and metrics\nto measure the quality of generated samples. We show that TemperatureGAN\nproduces high-fidelity examples with good spatial representation and temporal\ndynamics consistent with known diurnal cycles.\n","authors":["Emmanuel Balogun","Ram Rajagopal","Arun Majumdar"],"pdf_url":"https://arxiv.org/pdf/2306.17248v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.09234v2","updated":"2024-01-19T14:57:06Z","published":"2023-12-14T18:57:16Z","title":"Let's do the time-warp-attend: Learning topological invariants of\n dynamical systems","summary":" Dynamical systems across the sciences, from electrical circuits to ecological\nnetworks, undergo qualitative and often catastrophic changes in behavior,\ncalled bifurcations, when their underlying parameters cross a threshold.\nExisting methods predict oncoming catastrophes in individual systems but are\nprimarily time-series-based and struggle both to categorize qualitative\ndynamical regimes across diverse systems and to generalize to real data. To\naddress this challenge, we propose a data-driven, physically-informed\ndeep-learning framework for classifying dynamical regimes and characterizing\nbifurcation boundaries based on the extraction of topologically invariant\nfeatures. We focus on the paradigmatic case of the supercritical Hopf\nbifurcation, which is used to model periodic dynamics across a wide range of\napplications. Our convolutional attention method is trained with data\naugmentations that encourage the learning of topological invariants which can\nbe used to detect bifurcation boundaries in unseen systems and to design models\nof biological systems like oscillatory gene regulatory networks. We further\ndemonstrate our method's use in analyzing real data by recovering distinct\nproliferation and differentiation dynamics along pancreatic endocrinogenesis\ntrajectory in gene expression space based on single-cell data. Our method\nprovides valuable insights into the qualitative, long-term behavior of a wide\nrange of dynamical systems, and can detect bifurcations or catastrophic\ntransitions in large-scale physical and biological systems.\n","authors":["Noa Moriel","Matthew Ricci","Mor Nitzan"],"pdf_url":"https://arxiv.org/pdf/2312.09234v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.02901v2","updated":"2024-01-19T14:53:51Z","published":"2023-03-06T05:35:32Z","title":"$α$-divergence Improves the Entropy Production Estimation via\n Machine Learning","summary":" Recent years have seen a surge of interest in the algorithmic estimation of\nstochastic entropy production (EP) from trajectory data via machine learning. A\ncrucial element of such algorithms is the identification of a loss function\nwhose minimization guarantees the accurate EP estimation. In this study, we\nshow that there exists a host of loss functions, namely those implementing a\nvariational representation of the $\\alpha$-divergence, which can be used for\nthe EP estimation. By fixing $\\alpha$ to a value between $-1$ and $0$, the\n$\\alpha$-NEEP (Neural Estimator for Entropy Production) exhibits a much more\nrobust performance against strong nonequilibrium driving or slow dynamics,\nwhich adversely affects the existing method based on the Kullback-Leibler\ndivergence ($\\alpha = 0$). In particular, the choice of $\\alpha = -0.5$ tends\nto yield the optimal results. To corroborate our findings, we present an\nexactly solvable simplification of the EP estimation problem, whose loss\nfunction landscape and stochastic properties give deeper intuition into the\nrobustness of the $\\alpha$-NEEP.\n","authors":["Euijoon Kwon","Yongjoo Baek"],"pdf_url":"https://arxiv.org/pdf/2303.02901v2.pdf","comment":"11 pages, 9 figures"},{"id":"http://arxiv.org/abs/2401.10726v1","updated":"2024-01-19T14:43:04Z","published":"2024-01-19T14:43:04Z","title":"Empowering Aggregators with Practical Data-Driven Tools: Harnessing\n Aggregated and Disaggregated Flexibility for Demand Response","summary":" This study explores the crucial interplay between aggregators and building\noccupants in activating flexibility through Demand Response (DR) programs, with\na keen focus on achieving robust decarbonization and fortifying the resilience\nof the energy system amidst the uncertainties presented by Renewable Energy\nSources (RES). Firstly, it introduces a methodology of optimizing aggregated\nflexibility provision strategies in environments with limited data, utilizing\nDiscrete Fourier Transformation (DFT) and clustering techniques to identify\nbuilding occupant's activity patterns. Secondly, the study assesses the\ndisaggregated flexibility provision of Heating Ventilation and Air Conditioning\n(HVAC) systems during DR events, employing machine learning and optimization\ntechniques for precise, device-level analysis. The first approach offers a\nnon-intrusive pathway for aggregators to provide flexibility services in\nenvironments of a single smart meter for the whole building's consumption,\nwhile the second approach carefully considers building occupants' thermal\ncomfort profiles, while maximizing flexibility in case of existence of\ndedicated smart meters to the HVAC systems. Through the application of\ndata-driven techniques and encompassing case studies from both industrial and\nresidential buildings, this paper not only unveils pivotal opportunities for\naggregators in the balancing and emerging flexibility markets but also\nsuccessfully develops end-to-end practical tools for aggregators. Furthermore,\nthe efficacy of this tool is validated through detailed case studies,\nsubstantiating its operational capability and contributing to the evolution of\na resilient and efficient energy system.\n","authors":["Costas Mylonas","Donata Boric","Leila Luttenberger Maric","Alexandros Tsitsanis","Eleftheria Petrianou","Magda Foti"],"pdf_url":"https://arxiv.org/pdf/2401.10726v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10724v1","updated":"2024-01-19T14:36:01Z","published":"2024-01-19T14:36:01Z","title":"Real-Time Zero-Day Intrusion Detection System for Automotive Controller\n Area Network on FPGAs","summary":" Increasing automation in vehicles enabled by increased connectivity to the\noutside world has exposed vulnerabilities in previously siloed automotive\nnetworks like controller area networks (CAN). Attributes of CAN such as\nbroadcast-based communication among electronic control units (ECUs) that\nlowered deployment costs are now being exploited to carry out active injection\nattacks like denial of service (DoS), fuzzing, and spoofing attacks. Research\nliterature has proposed multiple supervised machine learning models deployed as\nIntrusion detection systems (IDSs) to detect such malicious activity; however,\nthese are largely limited to identifying previously known attack vectors. With\nthe ever-increasing complexity of active injection attacks, detecting zero-day\n(novel) attacks in these networks in real-time (to prevent propagation) becomes\na problem of particular interest. This paper presents an\nunsupervised-learning-based convolutional autoencoder architecture for\ndetecting zero-day attacks, which is trained only on benign (attack-free) CAN\nmessages. We quantise the model using Vitis-AI tools from AMD/Xilinx targeting\na resource-constrained Zynq Ultrascale platform as our IDS-ECU system for\nintegration. The proposed model successfully achieves equal or higher\nclassification accuracy (> 99.5%) on unseen DoS, fuzzing, and spoofing attacks\nfrom a publicly available attack dataset when compared to the state-of-the-art\nunsupervised learning-based IDSs. Additionally, by cleverly overlapping IDS\noperation on a window of CAN messages with the reception, the model is able to\nmeet line-rate detection (0.43 ms per window) of high-speed CAN, which when\ncoupled with the low energy consumption per inference, makes this architecture\nideally suited for detecting zero-day attacks on critical CAN networks.\n","authors":["Shashwat Khandelwal","Shreejith Shanker"],"pdf_url":"https://arxiv.org/pdf/2401.10724v1.pdf","comment":"8 pages, 6 figures, 7 tables"},{"id":"http://arxiv.org/abs/2311.03976v2","updated":"2024-01-19T14:34:47Z","published":"2023-11-07T13:24:01Z","title":"A Foundation Graph Model","summary":" The principal benefit of unsupervised graph representation learning is that a\npre-trained model can be fine-tuned where data or labels are scarce. Existing\napproaches are domain specific, maintaining consistent node and edge attributes\nacross the pre-training and target datasets. This precludes transfer to other\ndomains. A model capable of positive transfer on arbitrary tasks and domains\nwould represent the first foundation graph model.\n In this work we use adversarial contrastive learning to present FoToM, a\ngraph pre-training method based on node and edge feature exclusion. We use\nFoToM to pre-train models over multiple graph domains, producing the first\nfoundation graph models. We demonstrate positive transfer on evaluation\ndatasets from multiple domains, including domains not present in pre-training\ndata. On all datasets performance is at worst on-par and on 76% significantly\nbetter than a supervised baseline ($P \\leq 0.01$), with an 8 to 40% reduction\nin error at 95% confidence. Contrary to other research, pre-training on a\ndataset with the target domain excluded leads us to better performance than\npre-training on a dataset from only the target domain. The multi-domain model\nat worst, matches, and on 56% of tasks, significantly outperforms single-domain\n($P \\leq 0.01$). These results include when node labels are used in evaluation,\nwhere performance is consistently superior to single-domain or non-pre-trained\nmodels. Notably, FoToM benefits scenarios in both large or scarce data regimes\nfor the target domains.\n","authors":["Alex O. Davies","Riku W. Green","Nirav S. Ajmeri","Telmo M. Silva Filho"],"pdf_url":"https://arxiv.org/pdf/2311.03976v2.pdf","comment":"Presented at the NeurIPS 2023 New Frontiers in Graph Learning\n workshop"},{"id":"http://arxiv.org/abs/2401.10721v1","updated":"2024-01-19T14:32:50Z","published":"2024-01-19T14:32:50Z","title":"Generative Model for Constructing Reaction Path from Initial to Final\n States","summary":" Mapping out reaction pathways and their corresponding activation barriers is\na significant aspect of molecular simulation. Given their inherent complexity\nand nonlinearity, even generating a initial guess of these paths remains a\nchallenging problem. Presented in this paper is an innovative approach that\nutilizes neural networks to generate initial guess for these reaction pathways.\nThe proposed method is initiated by inputting the coordinates of the initial\nstate, followed by progressive alterations to its structure. This iterative\nprocess culminates in the generation of the approximate representation of the\nreaction path and the coordinates of the final state. The application of this\nmethod extends to complex reaction pathways illustrated by organic reactions.\nTraining was executed on the Transition1x dataset, an organic reaction pathway\ndataset. The results revealed generation of reactions that bore substantial\nsimilarities with the corresponding test data. The method's flexibility allows\nfor reactions to be generated either to conform to predetermined conditions or\nin a randomized manner.\n","authors":["Akihide Hayashi","So Takamoto","Ju Li","Daisuke Okanohara"],"pdf_url":"https://arxiv.org/pdf/2401.10721v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10710v1","updated":"2024-01-19T14:18:32Z","published":"2024-01-19T14:18:32Z","title":"Classification with neural networks with quadratic decision functions","summary":" Neural network with quadratic decision functions have been introduced as\nalternatives to standard neural networks with affine linear one. They are\nadvantageous when the objects to be identified are of compact basic geometries\nlike circles, ellipsis etc. In this paper we investigate the use of such ansatz\nfunctions for classification. In particular we test and compare the algorithm\non the MNIST dataset for classification of handwritten digits and for\nclassification of subspecies. We also show, that the implementation can be\nbased on the neural network structure in the software Tensorflow and Keras,\nrespectively.\n","authors":["Leon Frischauf","Otmar Scherzer","Cong Shi"],"pdf_url":"https://arxiv.org/pdf/2401.10710v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.15591v2","updated":"2024-01-19T14:08:23Z","published":"2023-12-25T02:32:05Z","title":"Privacy-Preserving Neural Graph Databases","summary":" In the era of big data and rapidly evolving information systems, efficient\nand accurate data retrieval has become increasingly crucial. Neural graph\ndatabases (NGDBs) have emerged as a powerful paradigm that combines the\nstrengths of graph databases (graph DBs) and neural networks to enable\nefficient storage, retrieval, and analysis of graph-structured data. The usage\nof neural embedding storage and complex neural logical query answering provides\nNGDBs with generalization ability. When the graph is incomplete, by extracting\nlatent patterns and representations, neural graph databases can fill gaps in\nthe graph structure, revealing hidden relationships and enabling accurate query\nanswering. Nevertheless, this capability comes with inherent trade-offs, as it\nintroduces additional privacy risks to the database. Malicious attackers can\ninfer more sensitive information in the database using well-designed\ncombinatorial queries, such as by comparing the answer sets of where Turing\nAward winners born before 1950 and after 1940 lived, the living places of\nTuring Award winner Hinton are probably exposed, although the living places may\nhave been deleted in the training due to the privacy concerns. In this work,\ninspired by the privacy protection in graph embeddings, we propose a\nprivacy-preserving neural graph database (P-NGDB) to alleviate the risks of\nprivacy leakage in NGDBs. We introduce adversarial training techniques in the\ntraining stage to force the NGDBs to generate indistinguishable answers when\nqueried with private information, enhancing the difficulty of inferring\nsensitive information through combinations of multiple innocuous queries.\nExtensive experiment results on three datasets show that P-NGDB can effectively\nprotect private information in the graph database while delivering high-quality\npublic answers responses to queries.\n","authors":["Qi Hu","Haoran Li","Jiaxin Bai","Yangqiu Song"],"pdf_url":"https://arxiv.org/pdf/2312.15591v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10700v1","updated":"2024-01-19T14:05:09Z","published":"2024-01-19T14:05:09Z","title":"Safe Offline Reinforcement Learning with Feasibility-Guided Diffusion\n Model","summary":" Safe offline RL is a promising way to bypass risky online interactions\ntowards safe policy learning. Most existing methods only enforce soft\nconstraints, i.e., constraining safety violations in expectation below\nthresholds predetermined. This can lead to potentially unsafe outcomes, thus\nunacceptable in safety-critical scenarios. An alternative is to enforce the\nhard constraint of zero violation. However, this can be challenging in offline\nsetting, as it needs to strike the right balance among three highly intricate\nand correlated aspects: safety constraint satisfaction, reward maximization,\nand behavior regularization imposed by offline datasets. Interestingly, we\ndiscover that via reachability analysis of safe-control theory, the hard safety\nconstraint can be equivalently translated to identifying the largest feasible\nregion given the offline dataset. This seamlessly converts the original trilogy\nproblem to a feasibility-dependent objective, i.e., maximizing reward value\nwithin the feasible region while minimizing safety risks in the infeasible\nregion. Inspired by these, we propose FISOR (FeasIbility-guided Safe Offline\nRL), which allows safety constraint adherence, reward maximization, and offline\npolicy learning to be realized via three decoupled processes, while offering\nstrong safety performance and stability. In FISOR, the optimal policy for the\ntranslated optimization problem can be derived in a special form of weighted\nbehavior cloning. Thus, we propose a novel energy-guided diffusion model that\ndoes not require training a complicated time-dependent classifier to extract\nthe policy, greatly simplifying the training. We compare FISOR against\nbaselines on DSRL benchmark for safe offline RL. Evaluation results show that\nFISOR is the only method that can guarantee safety satisfaction in all tasks,\nwhile achieving top returns in most tasks.\n","authors":["Yinan Zheng","Jianxiong Li","Dongjie Yu","Yujie Yang","Shengbo Eben Li","Xianyuan Zhan","Jingjing Liu"],"pdf_url":"https://arxiv.org/pdf/2401.10700v1.pdf","comment":"ICLR 2024, 30pages, 11 figures"},{"id":"http://arxiv.org/abs/2401.09902v2","updated":"2024-01-19T14:04:22Z","published":"2024-01-18T11:32:50Z","title":"Interplay between depth and width for interpolation in neural ODEs","summary":" Neural ordinary differential equations (neural ODEs) have emerged as a\nnatural tool for supervised learning from a control perspective, yet a complete\nunderstanding of their optimal architecture remains elusive. In this work, we\nexamine the interplay between their width $p$ and number of layer transitions\n$L$ (effectively the depth $L+1$). Specifically, we assess the model\nexpressivity in terms of its capacity to interpolate either a finite dataset\n$D$ comprising $N$ pairs of points or two probability measures in\n$\\mathbb{R}^d$ within a Wasserstein error margin $\\varepsilon>0$. Our findings\nreveal a balancing trade-off between $p$ and $L$, with $L$ scaling as\n$O(1+N/p)$ for dataset interpolation, and\n$L=O\\left(1+(p\\varepsilon^d)^{-1}\\right)$ for measure interpolation.\n In the autonomous case, where $L=0$, a separate study is required, which we\nundertake focusing on dataset interpolation. We address the relaxed problem of\n$\\varepsilon$-approximate controllability and establish an error decay of\n$\\varepsilon\\sim O(\\log(p)p^{-1/d})$. This decay rate is a consequence of\napplying a universal approximation theorem to a custom-built Lipschitz vector\nfield that interpolates $D$. In the high-dimensional setting, we further\ndemonstrate that $p=O(N)$ neurons are likely sufficient to achieve exact\ncontrol.\n","authors":["Antonio Álvarez-López","Arselane Hadj Slimane","Enrique Zuazua"],"pdf_url":"https://arxiv.org/pdf/2401.09902v2.pdf","comment":"16 pages, 10 figures, double column"},{"id":"http://arxiv.org/abs/2401.10690v1","updated":"2024-01-19T13:41:08Z","published":"2024-01-19T13:41:08Z","title":"Beyond RMSE and MAE: Introducing EAUC to unmask hidden bias and\n unfairness in dyadic regression models","summary":" Dyadic regression models, which predict real-valued outcomes for pairs of\nentities, are fundamental in many domains (e.g. predicting the rating of a user\nto a product in Recommender Systems) and promising and under exploration in\nmany others (e.g. approximating the adequate dosage of a drug for a patient in\npersonalized pharmacology). In this work, we demonstrate that non-uniformity in\nthe observed value distributions of individual entities leads to severely\nbiased predictions in state-of-the-art models, skewing predictions towards the\naverage of observed past values for the entity and providing worse-than-random\npredictive power in eccentric yet equally important cases. We show that the\nusage of global error metrics like Root Mean Squared Error (RMSE) and Mean\nAbsolute Error (MAE) is insufficient to capture this phenomenon, which we name\neccentricity bias, and we introduce Eccentricity-Area Under the Curve (EAUC) as\na new complementary metric that can quantify it in all studied models and\ndatasets. We also prove the adequateness of EAUC by using naive de-biasing\ncorrections to demonstrate that a lower model bias correlates with a lower EAUC\nand vice-versa. This work contributes a bias-aware evaluation of dyadic\nregression models to avoid potential unfairness and risks in critical\nreal-world applications of such systems.\n","authors":["Jorge Paz-Ruza","Amparo Alonso-Betanzos","Bertha Guijarro-Berdiñas","Brais Cancela","Carlos Eiras-Franco"],"pdf_url":"https://arxiv.org/pdf/2401.10690v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10689v1","updated":"2024-01-19T13:39:05Z","published":"2024-01-19T13:39:05Z","title":"A Lightweight Multi-Attack CAN Intrusion Detection System on Hybrid\n FPGAs","summary":" Rising connectivity in vehicles is enabling new capabilities like connected\nautonomous driving and advanced driver assistance systems (ADAS) for improving\nthe safety and reliability of next-generation vehicles. This increased access\nto in-vehicle functions compromises critical capabilities that use legacy\ninvehicle networks like Controller Area Network (CAN), which has no inherent\nsecurity or authentication mechanism. Intrusion detection and mitigation\napproaches, particularly using machine learning models, have shown promising\nresults in detecting multiple attack vectors in CAN through their ability to\ngeneralise to new vectors. However, most deployments require dedicated\ncomputing units like GPUs to perform line-rate detection, consuming much higher\npower. In this paper, we present a lightweight multi-attack quantised machine\nlearning model that is deployed using Xilinx's Deep Learning Processing Unit IP\non a Zynq Ultrascale+ (XCZU3EG) FPGA, which is trained and validated using the\npublic CAN Intrusion Detection dataset. The quantised model detects denial of\nservice and fuzzing attacks with an accuracy of above 99 % and a false positive\nrate of 0.07%, which are comparable to the state-of-the-art techniques in the\nliterature. The Intrusion Detection System (IDS) execution consumes just 2.0 W\nwith software tasks running on the ECU and achieves a 25 % reduction in\nper-message processing latency over the state-of-the-art implementations. This\ndeployment allows the ECU function to coexist with the IDS with minimal changes\nto the tasks, making it ideal for real-time IDS in in-vehicle systems.\n","authors":["Shashwat Khandelwal","Shreejith Shanker"],"pdf_url":"https://arxiv.org/pdf/2401.10689v1.pdf","comment":"5 pages, 2 figures, 6 tables"},{"id":"http://arxiv.org/abs/2401.10686v1","updated":"2024-01-19T13:33:23Z","published":"2024-01-19T13:33:23Z","title":"Manipulating Sparse Double Descent","summary":" This paper investigates the double descent phenomenon in two-layer neural\nnetworks, focusing on the role of L1 regularization and representation\ndimensions. It explores an alternative double descent phenomenon, named sparse\ndouble descent. The study emphasizes the complex relationship between model\ncomplexity, sparsity, and generalization, and suggests further research into\nmore diverse models and datasets. The findings contribute to a deeper\nunderstanding of neural network training and optimization.\n","authors":["Ya Shi Zhang"],"pdf_url":"https://arxiv.org/pdf/2401.10686v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10685v1","updated":"2024-01-19T13:32:55Z","published":"2024-01-19T13:32:55Z","title":"Towards End-to-End GPS Localization with Neural Pseudorange Correction","summary":" Pseudorange errors are the root cause of localization inaccuracy in GPS.\nPrevious data-driven methods regress and eliminate pseudorange errors using\nhandcrafted intermediate labels. Unlike them, we propose an end-to-end GPS\nlocalization framework, E2E-PrNet, to train a neural network for pseudorange\ncorrection (PrNet) directly using the final task loss calculated with the\nground truth of GPS receiver states. The gradients of the loss with respect to\nlearnable parameters are backpropagated through a differentiable nonlinear\nleast squares optimizer to PrNet. The feasibility is verified with GPS data\ncollected by Android phones, showing that E2E-PrNet outperforms the\nstate-of-the-art end-to-end GPS localization methods.\n","authors":["Xu Weng","KV Ling","Haochen Liu","Kun Cao"],"pdf_url":"https://arxiv.org/pdf/2401.10685v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10674v1","updated":"2024-01-19T13:13:38Z","published":"2024-01-19T13:13:38Z","title":"Deep Learning-based Embedded Intrusion Detection System for Automotive\n CAN","summary":" Rising complexity of in-vehicle electronics is enabling new capabilities like\nautonomous driving and active safety. However, rising automation also increases\nrisk of security threats which is compounded by lack of in-built security\nmeasures in legacy networks like CAN, allowing attackers to observe, tamper and\nmodify information shared over such broadcast networks. Various intrusion\ndetection approaches have been proposed to detect and tackle such threats, with\nmachine learning models proving highly effective. However, deploying machine\nlearning models will require high processing power through high-end processors\nor GPUs to perform them close to line rate. In this paper, we propose a hybrid\nFPGA-based ECU approach that can transparently integrate IDS functionality\nthrough a dedicated off-the-shelf hardware accelerator that implements a\ndeep-CNN intrusion detection model. Our results show that the proposed approach\nprovides an average accuracy of over 99% across multiple attack datasets with\n0.64% false detection rates while consuming 94% less energy and achieving 51.8%\nreduction in per-message processing latency when compared to IDS\nimplementations on GPUs.\n","authors":["Shashwat Khandelwal","Eashan Wadhwa","Shreejith Shanker"],"pdf_url":"https://arxiv.org/pdf/2401.10674v1.pdf","comment":"5 pages, 1 figure, 8 tables"},{"id":"http://arxiv.org/abs/2401.09691v2","updated":"2024-01-19T12:43:36Z","published":"2024-01-18T02:44:18Z","title":"Imitation Learning Inputting Image Feature to Each Layer of Neural\n Network","summary":" Imitation learning enables robots to learn and replicate human behavior from\ntraining data. Recent advances in machine learning enable end-to-end learning\napproaches that directly process high-dimensional observation data, such as\nimages. However, these approaches face a critical challenge when processing\ndata from multiple modalities, inadvertently ignoring data with a lower\ncorrelation to the desired output, especially when using short sampling\nperiods. This paper presents a useful method to address this challenge, which\namplifies the influence of data with a relatively low correlation to the output\nby inputting the data into each neural network layer. The proposed approach\neffectively incorporates diverse data sources into the learning process.\nThrough experiments using a simple pick-and-place operation with raw images and\njoint information as input, significant improvements in success rates are\ndemonstrated even when dealing with data from short sampling periods.\n","authors":["Koki Yamane","Sho Sakaino","Toshiaki Tsuji"],"pdf_url":"https://arxiv.org/pdf/2401.09691v2.pdf","comment":"6 pages, 4 figures, Accepted at AMC2024"},{"id":"http://arxiv.org/abs/2312.01185v2","updated":"2024-01-19T12:34:07Z","published":"2023-12-02T17:24:17Z","title":"A ripple in time: a discontinuity in American history","summary":" In this note we use the State of the Union Address (SOTU) dataset from Kaggle\nto make some surprising (and some not so surprising) observations pertaining to\nthe general timeline of American history, and the character and nature of the\naddresses themselves. Our main approach is using vector embeddings, such as\nBERT (DistilBERT) and GPT-2.\n While it is widely believed that BERT (and its variations) is most suitable\nfor NLP classification tasks, we find out that GPT-2 in conjunction with\nnonlinear dimension reduction methods such as UMAP provide better separation\nand stronger clustering. This makes GPT-2 + UMAP an interesting alternative. In\nour case, no model fine-tuning is required, and the pre-trained out-of-the-box\nGPT-2 model is enough.\n We also used a fine-tuned DistilBERT model for classification detecting which\nPresident delivered which address, with very good results (accuracy 93\\% - 95\\%\ndepending on the run). An analogous task was performed to determine the year of\nwriting, and we were able to pin it down to about 4 years (which is a single\npresidential term).\n It is worth noting that SOTU addresses provide relatively small writing\nsamples (with about 8000 words on average, and varying widely from under 2000\nwords to more than 20000), and that the amount of authors is relatively large\n(we used SOTU addresses of 42 US presidents). This shows that the techniques\nemployed turn out to be rather efficient, while all the computations described\nin this note can be performed using a single GPU instance of Google Colab.\n The accompanying code is available on GitHub.\n","authors":["Alexander Kolpakov","Igor Rivin"],"pdf_url":"https://arxiv.org/pdf/2312.01185v2.pdf","comment":"7 pages, 8 figures; GitHub repository\n https://github.com/sashakolpakov/ripple_in_time"},{"id":"http://arxiv.org/abs/2312.08010v2","updated":"2024-01-19T12:19:48Z","published":"2023-12-13T09:33:08Z","title":"EZ-CLIP: Efficient Zeroshot Video Action Recognition","summary":" Recent advancements in large-scale pre-training of visual-language models on\npaired image-text data have demonstrated impressive generalization capabilities\nfor zero-shot tasks. Building on this success, efforts have been made to adapt\nthese image-based visual-language models, such as CLIP, for videos extending\ntheir zero-shot capabilities to the video domain. While these adaptations have\nshown promising results, they come at a significant computational cost and\nstruggle with effectively modeling the crucial temporal aspects inherent to the\nvideo domain. In this study, we present EZ-CLIP, a simple and efficient\nadaptation of CLIP that addresses these challenges. EZ-CLIP leverages temporal\nvisual prompting for seamless temporal adaptation, requiring no fundamental\nalterations to the core CLIP architecture while preserving its remarkable\ngeneralization abilities. Moreover, we introduce a novel learning objective\nthat guides the temporal visual prompts to focus on capturing motion, thereby\nenhancing its learning capabilities from video data. We conducted extensive\nexperiments on five different benchmark datasets, thoroughly evaluating EZ-CLIP\nfor zero-shot learning and base-to-novel video action recognition, and also\ndemonstrating its potential for few-shot generalization.Impressively, with a\nmere 5.2 million learnable parameters (as opposed to the 71.1 million in the\nprior best model), EZ-CLIP can be efficiently trained on a single GPU,\noutperforming existing approaches in several evaluations.\n","authors":["Shahzad Ahmad","Sukalpa Chanda","Yogesh S Rawat"],"pdf_url":"https://arxiv.org/pdf/2312.08010v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10657v1","updated":"2024-01-19T12:04:31Z","published":"2024-01-19T12:04:31Z","title":"FIMBA: Evaluating the Robustness of AI in Genomics via Feature\n Importance Adversarial Attacks","summary":" With the steady rise of the use of AI in bio-technical applications and the\nwidespread adoption of genomics sequencing, an increasing amount of AI-based\nalgorithms and tools is entering the research and production stage affecting\ncritical decision-making streams like drug discovery and clinical outcomes.\nThis paper demonstrates the vulnerability of AI models often utilized\ndownstream tasks on recognized public genomics datasets. We undermine model\nrobustness by deploying an attack that focuses on input transformation while\nmimicking the real data and confusing the model decision-making, ultimately\nyielding a pronounced deterioration in model performance. Further, we enhance\nour approach by generating poisoned data using a variational autoencoder-based\nmodel. Our empirical findings unequivocally demonstrate a decline in model\nperformance, underscored by diminished accuracy and an upswing in false\npositives and false negatives. Furthermore, we analyze the resulting\nadversarial samples via spectral analysis yielding conclusions for\ncountermeasures against such attacks.\n","authors":["Heorhii Skovorodnikov","Hoda Alkhzaimi"],"pdf_url":"https://arxiv.org/pdf/2401.10657v1.pdf","comment":"15 pages, core code available at:\n https://github.com/HeorhiiS/fimba-attack"},{"id":"http://arxiv.org/abs/2401.10653v1","updated":"2024-01-19T11:59:13Z","published":"2024-01-19T11:59:13Z","title":"Attentive Fusion: A Transformer-based Approach to Multimodal Hate Speech\n Detection","summary":" With the recent surge and exponential growth of social media usage,\nscrutinizing social media content for the presence of any hateful content is of\nutmost importance. Researchers have been diligently working since the past\ndecade on distinguishing between content that promotes hatred and content that\ndoes not. Traditionally, the main focus has been on analyzing textual content.\nHowever, recent research attempts have also commenced into the identification\nof audio-based content. Nevertheless, studies have shown that relying solely on\naudio or text-based content may be ineffective, as recent upsurge indicates\nthat individuals often employ sarcasm in their speech and writing. To overcome\nthese challenges, we present an approach to identify whether a speech promotes\nhate or not utilizing both audio and textual representations. Our methodology\nis based on the Transformer framework that incorporates both audio and text\nsampling, accompanied by our very own layer called \"Attentive Fusion\". The\nresults of our study surpassed previous state-of-the-art techniques, achieving\nan impressive macro F1 score of 0.927 on the Test Set.\n","authors":["Atanu Mandal","Gargi Roy","Amit Barman","Indranil Dutta","Sudip Kumar Naskar"],"pdf_url":"https://arxiv.org/pdf/2401.10653v1.pdf","comment":"Accepted in 20th International Conference on Natural Language\n Processing (ICON)"},{"id":"http://arxiv.org/abs/2401.10652v1","updated":"2024-01-19T11:58:13Z","published":"2024-01-19T11:58:13Z","title":"AutoChunk: Automated Activation Chunk for Memory-Efficient Long Sequence\n Inference","summary":" Large deep learning models have achieved impressive performance across a\nrange of applications. However, their large memory requirements, including\nparameter memory and activation memory, have become a significant challenge for\ntheir practical serving. While existing methods mainly address parameter\nmemory, the importance of activation memory has been overlooked. Especially for\nlong input sequences, activation memory is expected to experience a significant\nexponential growth as the length of sequences increases. In this approach, we\npropose AutoChunk, an automatic and adaptive compiler system that efficiently\nreduces activation memory for long sequence inference by chunk strategies. The\nproposed system generates chunk plans by optimizing through multiple stages. In\neach stage, the chunk search pass explores all possible chunk candidates and\nthe chunk selection pass identifies the optimal one. At runtime, AutoChunk\nemploys code generation to automatically apply chunk strategies. The\nexperiments demonstrate that AutoChunk can reduce over 80\\% of activation\nmemory while maintaining speed loss within 10%, extend max sequence length by\n3.2x to 11.7x, and outperform state-of-the-art methods by a large margin.\n","authors":["Xuanlei Zhao","Shenggan Cheng","Guangyang Lu","Jiarui Fang","Haotian Zhou","Bin Jia","Ziming Liu","Yang You"],"pdf_url":"https://arxiv.org/pdf/2401.10652v1.pdf","comment":"ICLR 2024"},{"id":"http://arxiv.org/abs/2401.10648v1","updated":"2024-01-19T11:48:52Z","published":"2024-01-19T11:48:52Z","title":"Area Modeling using Stay Information for Large-Scale Users and Analysis\n for Influence of COVID-19","summary":" Understanding how people use area in a city can be a valuable information in\na wide range of fields, from marketing to urban planning. Area usage is subject\nto change over time due to various events including seasonal shifts and\npandemics. Before the spread of smartphones, this data had been collected\nthrough questionnaire survey. However, this is not a sustainable approach in\nterms of time to results and cost. There are many existing studies on area\nmodeling, which characterize an area with some kind of information, using Point\nof Interest (POI) or inter-area movement data. However, since POI is data that\nis statically tied to space, and inter-area movement data ignores the behavior\nof people within an area, existing methods are not sufficient in terms of\ncapturing area usage changes. In this paper, we propose a novel area modeling\nmethod named Area2Vec, inspired by Word2Vec, which models areas based on\npeople's location data. This method is based on the discovery that it is\npossible to characterize an area based on its usage by using people's stay\ninformation in the area. And it is a novel method that can reflect the\ndynamically changing people's behavior in an area in the modeling results. We\nvalidated Area2vec by performing a functional classification of areas in a\ndistrict of Japan. The results show that Area2Vec can be usable in general area\nanalysis. We also investigated area usage changes due to COVID-19 in two\ndistricts in Japan. We could find that COVID-19 made people refrain from\nunnecessary going out, such as visiting entertainment areas.\n","authors":["Kazuyuki Shoji","Shunsuke Aoki","Takuro Yonezawa","Nobuo Kawaguchi"],"pdf_url":"https://arxiv.org/pdf/2401.10648v1.pdf","comment":"This paper is an English translation of the paper published in the\n Transactions of the Information Processing Society of Japan\n (http://doi.org/10.20729/00213190)"},{"id":"http://arxiv.org/abs/2401.10646v1","updated":"2024-01-19T11:47:49Z","published":"2024-01-19T11:47:49Z","title":"Empowering HWNs with Efficient Data Labeling: A Clustered Federated\n Semi-Supervised Learning Approach","summary":" Clustered Federated Multitask Learning (CFL) has gained considerable\nattention as an effective strategy for overcoming statistical challenges,\nparticularly when dealing with non independent and identically distributed (non\nIID) data across multiple users. However, much of the existing research on CFL\noperates under the unrealistic premise that devices have access to accurate\nground truth labels. This assumption becomes especially problematic in\nhierarchical wireless networks (HWNs), where edge networks contain a large\namount of unlabeled data, resulting in slower convergence rates and increased\nprocessing times, particularly when dealing with two layers of model\naggregation. To address these issues, we introduce a novel framework, Clustered\nFederated Semi-Supervised Learning (CFSL), designed for more realistic HWN\nscenarios. Our approach leverages a best-performing specialized model\nalgorithm, wherein each device is assigned a specialized model that is highly\nadept at generating accurate pseudo-labels for unlabeled data, even when the\ndata stems from diverse environments. We validate the efficacy of CFSL through\nextensive experiments, comparing it with existing methods highlighted in recent\nliterature. Our numerical results demonstrate that CFSL significantly improves\nupon key metrics such as testing accuracy, labeling accuracy, and labeling\nlatency under varying proportions of labeled and unlabeled data while also\naccommodating the non-IID nature of the data and the unique characteristics of\nwireless edge networks.\n","authors":["Moqbel Hamood","Abdullatif Albaseer","Mohamed Abdallah","Ala Al-Fuqaha"],"pdf_url":"https://arxiv.org/pdf/2401.10646v1.pdf","comment":"Accepted for IEEE Wireless Communications and Networking Conference\n (WCNC) 2024"},{"id":"http://arxiv.org/abs/2401.10643v1","updated":"2024-01-19T11:45:10Z","published":"2024-01-19T11:45:10Z","title":"A Comprehensive Survey on Deep-Learning-based Vehicle Re-Identification:\n Models, Data Sets and Challenges","summary":" Vehicle re-identification (ReID) endeavors to associate vehicle images\ncollected from a distributed network of cameras spanning diverse traffic\nenvironments. This task assumes paramount importance within the spectrum of\nvehicle-centric technologies, playing a pivotal role in deploying Intelligent\nTransportation Systems (ITS) and advancing smart city initiatives. Rapid\nadvancements in deep learning have significantly propelled the evolution of\nvehicle ReID technologies in recent years. Consequently, undertaking a\ncomprehensive survey of methodologies centered on deep learning for vehicle\nre-identification has become imperative and inescapable. This paper extensively\nexplores deep learning techniques applied to vehicle ReID. It outlines the\ncategorization of these methods, encompassing supervised and unsupervised\napproaches, delves into existing research within these categories, introduces\ndatasets and evaluation criteria, and delineates forthcoming challenges and\npotential research directions. This comprehensive assessment examines the\nlandscape of deep learning in vehicle ReID and establishes a foundation and\nstarting point for future works. It aims to serve as a complete reference by\nhighlighting challenges and emerging trends, fostering advancements and\napplications in vehicle ReID utilizing deep learning models.\n","authors":["Ali Amiri","Aydin Kaya","Ali Seydi Keceli"],"pdf_url":"https://arxiv.org/pdf/2401.10643v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10637v1","updated":"2024-01-19T11:35:07Z","published":"2024-01-19T11:35:07Z","title":"Towards Universal Unsupervised Anomaly Detection in Medical Imaging","summary":" The increasing complexity of medical imaging data underscores the need for\nadvanced anomaly detection methods to automatically identify diverse\npathologies. Current methods face challenges in capturing the broad spectrum of\nanomalies, often limiting their use to specific lesion types in brain scans. To\naddress this challenge, we introduce a novel unsupervised approach, termed\n\\textit{Reversed Auto-Encoders (RA)}, designed to create realistic\npseudo-healthy reconstructions that enable the detection of a wider range of\npathologies. We evaluate the proposed method across various imaging modalities,\nincluding magnetic resonance imaging (MRI) of the brain, pediatric wrist X-ray,\nand chest X-ray, and demonstrate superior performance in detecting anomalies\ncompared to existing state-of-the-art methods. Our unsupervised anomaly\ndetection approach may enhance diagnostic accuracy in medical imaging by\nidentifying a broader range of unknown pathologies. Our code is publicly\navailable at: \\url{https://github.com/ci-ber/RA}.\n","authors":["Cosmin I. Bercea","Benedikt Wiestler","Daniel Rueckert","Julia A. Schnabel"],"pdf_url":"https://arxiv.org/pdf/2401.10637v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10632v1","updated":"2024-01-19T11:20:31Z","published":"2024-01-19T11:20:31Z","title":"Interventional Fairness on Partially Known Causal Graphs: A Constrained\n Optimization Approach","summary":" Fair machine learning aims to prevent discrimination against individuals or\nsub-populations based on sensitive attributes such as gender and race. In\nrecent years, causal inference methods have been increasingly used in fair\nmachine learning to measure unfairness by causal effects. However, current\nmethods assume that the true causal graph is given, which is often not true in\nreal-world applications. To address this limitation, this paper proposes a\nframework for achieving causal fairness based on the notion of interventions\nwhen the true causal graph is partially known. The proposed approach involves\nmodeling fair prediction using a Partially Directed Acyclic Graph (PDAG),\nspecifically, a class of causal DAGs that can be learned from observational\ndata combined with domain knowledge. The PDAG is used to measure causal\nfairness, and a constrained optimization problem is formulated to balance\nbetween fairness and accuracy. Results on both simulated and real-world\ndatasets demonstrate the effectiveness of this method.\n","authors":["Aoqi Zuo","Yiqing Li","Susan Wei","Mingming Gong"],"pdf_url":"https://arxiv.org/pdf/2401.10632v1.pdf","comment":"Accepted to ICLR24"},{"id":"http://arxiv.org/abs/2401.10620v1","updated":"2024-01-19T10:52:57Z","published":"2024-01-19T10:52:57Z","title":"Polytopic Autoencoders with Smooth Clustering for Reduced-order\n Modelling of Flows","summary":" With the advancement of neural networks, there has been a notable increase,\nboth in terms of quantity and variety, in research publications concerning the\napplication of autoencoders to reduced-order models. We propose a polytopic\nautoencoder architecture that includes a lightweight nonlinear encoder, a\nconvex combination decoder, and a smooth clustering network. Supported by\nseveral proofs, the model architecture ensures that all reconstructed states\nlie within a polytope, accompanied by a metric indicating the quality of the\nconstructed polytopes, referred to as polytope error. Additionally, it offers a\nminimal number of convex coordinates for polytopic linear-parameter varying\nsystems while achieving acceptable reconstruction errors compared to proper\northogonal decomposition (POD). To validate our proposed model, we conduct\nsimulations involving two flow scenarios with the incompressible Navier-Stokes\nequation. Numerical results demonstrate the guaranteed properties of the model,\nlow reconstruction errors compared to POD, and the improvement in error using a\nclustering network.\n","authors":["Jan Heiland","Yongho Kim"],"pdf_url":"https://arxiv.org/pdf/2401.10620v1.pdf","comment":"28 pages, 18 figures"},{"id":"http://arxiv.org/abs/2401.10603v1","updated":"2024-01-19T10:21:27Z","published":"2024-01-19T10:21:27Z","title":"ZnTrack -- Data as Code","summary":" The past decade has seen tremendous breakthroughs in computation and there is\nno indication that this will slow any time soon. Machine learning, large-scale\ncomputing resources, and increased industry focus have resulted in rising\ninvestments in computer-driven solutions for data management, simulations, and\nmodel generation. However, with this growth in computation has come an even\nlarger expansion of data and with it, complexity in data storage, sharing, and\ntracking. In this work, we introduce ZnTrack, a Python-driven data versioning\ntool. ZnTrack builds upon established version control systems to provide a\nuser-friendly and easy-to-use interface for tracking parameters in experiments,\ndesigning workflows, and storing and sharing data. From this ability to reduce\nlarge datasets to a simple Python script emerges the concept of Data as Code, a\ncore component of the work presented here and an undoubtedly important concept\nas the age of computation continues to evolve. ZnTrack offers an open-source,\nFAIR data compatible Python package to enable users to harness these concepts\nof the future.\n","authors":["Fabian Zills","Moritz Schäfer","Samuel Tovey","Johannes Kästner","Christian Holm"],"pdf_url":"https://arxiv.org/pdf/2401.10603v1.pdf","comment":"22 pages, 10 figures, 2MB PDF"},{"id":"http://arxiv.org/abs/2311.11809v2","updated":"2024-01-19T10:10:27Z","published":"2023-11-20T14:42:13Z","title":"LogLead -- Fast and Integrated Log Loader, Enhancer, and Anomaly\n Detector","summary":" This paper introduces LogLead, a tool designed for efficient log analysis\nbenchmarking. LogLead combines three essential steps in log processing:\nloading, enhancing, and anomaly detection. The tool leverages Polars, a\nhigh-speed DataFrame library. We currently have Loaders for eight systems that\nare publicly available (HDFS, Hadoop, BGL, Thunderbird, Spirit, Liberty,\nTrainTicket, and GC Webshop). We have multiple enhancers with three parsers\n(Drain, Spell, LenMa), Bert embedding creation and other log representation\ntechniques like bag-of-words. LogLead integrates to five supervised and four\nunsupervised machine learning algorithms for anomaly detection from SKLearn. By\nintegrating diverse datasets, log representation methods and anomaly detectors,\nLogLead facilitates comprehensive benchmarking in log analysis research. We\nshow that log loading from raw file to dataframe is over 10x faster with\nLogLead compared to past solutions. We demonstrate roughly 2x improvement in\nDrain parsing speed by off-loading log message normalization to LogLead. Our\nbrief benchmarking on HDFS indicates that log representations extending beyond\nthe bag-of-words approach offer limited additional benefits. Tool URL:\nhttps://github.com/EvoTestOps/LogLead\n","authors":["Mika Mäntylä","Yuqing Wang","Jesse Nyyssölä"],"pdf_url":"https://arxiv.org/pdf/2311.11809v2.pdf","comment":"2024 IEEE International Conference on Software Analysis, Evolution\n and Reengineering (SANER)"},{"id":"http://arxiv.org/abs/2401.10590v1","updated":"2024-01-19T10:02:20Z","published":"2024-01-19T10:02:20Z","title":"Adversarially Robust Signed Graph Contrastive Learning from Balance\n Augmentation","summary":" Signed graphs consist of edges and signs, which can be separated into\nstructural information and balance-related information, respectively. Existing\nsigned graph neural networks (SGNNs) typically rely on balance-related\ninformation to generate embeddings. Nevertheless, the emergence of recent\nadversarial attacks has had a detrimental impact on the balance-related\ninformation. Similar to how structure learning can restore unsigned graphs,\nbalance learning can be applied to signed graphs by improving the balance\ndegree of the poisoned graph. However, this approach encounters the challenge\n\"Irreversibility of Balance-related Information\" - while the balance degree\nimproves, the restored edges may not be the ones originally affected by\nattacks, resulting in poor defense effectiveness. To address this challenge, we\npropose a robust SGNN framework called Balance Augmented-Signed Graph\nContrastive Learning (BA-SGCL), which combines Graph Contrastive Learning\nprinciples with balance augmentation techniques. Experimental results\ndemonstrate that BA-SGCL not only enhances robustness against existing\nadversarial attacks but also achieves superior performance on link sign\nprediction task across various datasets.\n","authors":["Jialong Zhou","Xing Ai","Yuni Lai","Kai Zhou"],"pdf_url":"https://arxiv.org/pdf/2401.10590v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10191v2","updated":"2024-01-19T10:01:36Z","published":"2024-01-18T18:25:29Z","title":"Divide and not forget: Ensemble of selectively trained experts in\n Continual Learning","summary":" Class-incremental learning is becoming more popular as it helps models widen\ntheir applicability while not forgetting what they already know. A trend in\nthis area is to use a mixture-of-expert technique, where different models work\ntogether to solve the task. However, the experts are usually trained all at\nonce using whole task data, which makes them all prone to forgetting and\nincreasing computational burden. To address this limitation, we introduce a\nnovel approach named SEED. SEED selects only one, the most optimal expert for a\nconsidered task, and uses data from this task to fine-tune only this expert.\nFor this purpose, each expert represents each class with a Gaussian\ndistribution, and the optimal expert is selected based on the similarity of\nthose distributions. Consequently, SEED increases diversity and heterogeneity\nwithin the experts while maintaining the high stability of this ensemble\nmethod. The extensive experiments demonstrate that SEED achieves\nstate-of-the-art performance in exemplar-free settings across various\nscenarios, showing the potential of expert diversification through data in\ncontinual learning.\n","authors":["Grzegorz Rypeść","Sebastian Cygert","Valeriya Khan","Tomasz Trzciński","Bartosz Zieliński","Bartłomiej Twardowski"],"pdf_url":"https://arxiv.org/pdf/2401.10191v2.pdf","comment":"Accepted for ICLR 2024 (main track), code is available at:\n https://github.com/grypesc/SEED"},{"id":"http://arxiv.org/abs/2401.10586v1","updated":"2024-01-19T09:54:23Z","published":"2024-01-19T09:54:23Z","title":"PuriDefense: Randomized Local Implicit Adversarial Purification for\n Defending Black-box Query-based Attacks","summary":" Black-box query-based attacks constitute significant threats to Machine\nLearning as a Service (MLaaS) systems since they can generate adversarial\nexamples without accessing the target model's architecture and parameters.\nTraditional defense mechanisms, such as adversarial training, gradient masking,\nand input transformations, either impose substantial computational costs or\ncompromise the test accuracy of non-adversarial inputs. To address these\nchallenges, we propose an efficient defense mechanism, PuriDefense, that\nemploys random patch-wise purifications with an ensemble of lightweight\npurification models at a low level of inference cost. These models leverage the\nlocal implicit function and rebuild the natural image manifold. Our theoretical\nanalysis suggests that this approach slows down the convergence of query-based\nattacks by incorporating randomness into purifications. Extensive experiments\non CIFAR-10 and ImageNet validate the effectiveness of our proposed\npurifier-based defense mechanism, demonstrating significant improvements in\nrobustness against query-based attacks.\n","authors":["Ping Guo","Zhiyuan Yang","Xi Lin","Qingchuan Zhao","Qingfu Zhang"],"pdf_url":"https://arxiv.org/pdf/2401.10586v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.12399v3","updated":"2024-01-19T09:49:46Z","published":"2023-11-21T07:22:48Z","title":"A Survey of Graph Meets Large Language Model: Progress and Future\n Directions","summary":" Graph plays a significant role in representing and analyzing complex\nrelationships in real-world applications such as citation networks, social\nnetworks, and biological data. Recently, Large Language Models (LLMs), which\nhave achieved tremendous success in various domains, have also been leveraged\nin graph-related tasks to surpass traditional Graph Neural Networks (GNNs)\nbased methods and yield state-of-the-art performance. In this survey, we first\npresent a comprehensive review and analysis of existing methods that integrate\nLLMs with graphs. First of all, we propose a new taxonomy, which organizes\nexisting methods into three categories based on the role (i.e., enhancer,\npredictor, and alignment component) played by LLMs in graph-related tasks. Then\nwe systematically survey the representative methods along the three categories\nof the taxonomy. Finally, we discuss the remaining limitations of existing\nstudies and highlight promising avenues for future research. The relevant\npapers are summarized and will be consistently updated at:\nhttps://github.com/yhLeeee/Awesome-LLMs-in-Graph-tasks.\n","authors":["Yuhan Li","Zhixun Li","Peisong Wang","Jia Li","Xiangguo Sun","Hong Cheng","Jeffrey Xu Yu"],"pdf_url":"https://arxiv.org/pdf/2311.12399v3.pdf","comment":"Work in progress; 13 pages, 5 figures"},{"id":"http://arxiv.org/abs/2401.10566v1","updated":"2024-01-19T09:10:58Z","published":"2024-01-19T09:10:58Z","title":"Robust Multi-Modal Density Estimation","summary":" Development of multi-modal, probabilistic prediction models has lead to a\nneed for comprehensive evaluation metrics. While several metrics can\ncharacterize the accuracy of machine-learned models (e.g., negative\nlog-likelihood, Jensen-Shannon divergence), these metrics typically operate on\nprobability densities. Applying them to purely sample-based prediction models\nthus requires that the underlying density function is estimated. However,\ncommon methods such as kernel density estimation (KDE) have been demonstrated\nto lack robustness, while more complex methods have not been evaluated in\nmulti-modal estimation problems. In this paper, we present ROME (RObust\nMulti-modal density Estimator), a non-parametric approach for density\nestimation which addresses the challenge of estimating multi-modal, non-normal,\nand highly correlated distributions. ROME utilizes clustering to segment a\nmulti-modal set of samples into multiple uni-modal ones and then combines\nsimple KDE estimates obtained for individual clusters in a single multi-modal\nestimate. We compared our approach to state-of-the-art methods for density\nestimation as well as ablations of ROME, showing that it not only outperforms\nestablished methods but is also more robust to a variety of distributions. Our\nresults demonstrate that ROME can overcome the issues of over-fitting and\nover-smoothing exhibited by other estimators, promising a more robust\nevaluation of probabilistic machine learning models.\n","authors":["Anna Mészáros","Julian F. Schumann","Javier Alonso-Mora","Arkady Zgonnikov","Jens Kober"],"pdf_url":"https://arxiv.org/pdf/2401.10566v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10559v1","updated":"2024-01-19T08:50:54Z","published":"2024-01-19T08:50:54Z","title":"OrchMoE: Efficient Multi-Adapter Learning with Task-Skill Synergy","summary":" We advance the field of Parameter-Efficient Fine-Tuning (PEFT) with our novel\nmulti-adapter method, OrchMoE, which capitalizes on modular skill architecture\nfor enhanced forward transfer in neural networks. Unlike prior models that\ndepend on explicit task identification inputs, OrchMoE automatically discerns\ntask categories, streamlining the learning process. This is achieved through an\nintegrated mechanism comprising an Automatic Task Classification module and a\nTask-Skill Allocation module, which collectively deduce task-specific\nclassifications and tailor skill allocation matrices. Our extensive evaluations\non the 'Super Natural Instructions' dataset, featuring 1,600 diverse\ninstructional tasks, indicate that OrchMoE substantially outperforms comparable\nmulti-adapter baselines in terms of both performance and sample utilization\nefficiency, all while operating within the same parameter constraints. These\nfindings suggest that OrchMoE offers a significant leap forward in multi-task\nlearning efficiency.\n","authors":["Haowen Wang","Tao Sun","Kaixiang Ji","Jian Wang","Cong Fan","Jinjie Gu"],"pdf_url":"https://arxiv.org/pdf/2401.10559v1.pdf","comment":"9 pages, 3 figures"},{"id":"http://arxiv.org/abs/2401.10549v1","updated":"2024-01-19T08:26:44Z","published":"2024-01-19T08:26:44Z","title":"Unified View Imputation and Feature Selection Learning for Incomplete\n Multi-view Data","summary":" Although multi-view unsupervised feature selection (MUFS) is an effective\ntechnology for reducing dimensionality in machine learning, existing methods\ncannot directly deal with incomplete multi-view data where some samples are\nmissing in certain views. These methods should first apply predetermined values\nto impute missing data, then perform feature selection on the complete dataset.\nSeparating imputation and feature selection processes fails to capitalize on\nthe potential synergy where local structural information gleaned from feature\nselection could guide the imputation, thereby improving the feature selection\nperformance in turn. Additionally, previous methods only focus on leveraging\nsamples' local structure information, while ignoring the intrinsic locality of\nthe feature space. To tackle these problems, a novel MUFS method, called\nUNified view Imputation and Feature selectIon lEaRning (UNIFIER), is proposed.\nUNIFIER explores the local structure of multi-view data by adaptively learning\nsimilarity-induced graphs from both the sample and feature spaces. Then,\nUNIFIER dynamically recovers the missing views, guided by the sample and\nfeature similarity graphs during the feature selection procedure. Furthermore,\nthe half-quadratic minimization technique is used to automatically weight\ndifferent instances, alleviating the impact of outliers and unreliable restored\ndata. Comprehensive experimental results demonstrate that UNIFIER outperforms\nother state-of-the-art methods.\n","authors":["Yanyong Huang","Zongxin Shen","Tianrui Li","Fengmao Lv"],"pdf_url":"https://arxiv.org/pdf/2401.10549v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10547v1","updated":"2024-01-19T08:13:10Z","published":"2024-01-19T08:13:10Z","title":"PhoGAD: Graph-based Anomaly Behavior Detection with Persistent Homology\n Optimization","summary":" A multitude of toxic online behaviors, ranging from network attacks to\nanonymous traffic and spam, have severely disrupted the smooth operation of\nnetworks. Due to the inherent sender-receiver nature of network behaviors,\ngraph-based frameworks are commonly used for detecting anomalous behaviors.\nHowever, in real-world scenarios, the boundary between normal and anomalous\nbehaviors tends to be ambiguous. The local heterophily of graphs interferes\nwith the detection, and existing methods based on nodes or edges introduce\nunwanted noise into representation results, thereby impacting the effectiveness\nof detection. To address these issues, we propose PhoGAD, a graph-based anomaly\ndetection framework. PhoGAD leverages persistent homology optimization to\nclarify behavioral boundaries. Building upon this, the weights of adjacent\nedges are designed to mitigate the effects of local heterophily. Subsequently,\nto tackle the noise problem, we conduct a formal analysis and propose a\ndisentangled representation-based explicit embedding method, ultimately\nachieving anomaly behavior detection. Experiments on intrusion, traffic, and\nspam datasets verify that PhoGAD has surpassed the performance of\nstate-of-the-art (SOTA) frameworks in detection efficacy. Notably, PhoGAD\ndemonstrates robust detection even with diminished anomaly proportions,\nhighlighting its applicability to real-world scenarios. The analysis of\npersistent homology demonstrates its effectiveness in capturing the topological\nstructure formed by normal edge features. Additionally, ablation experiments\nvalidate the effectiveness of the innovative mechanisms integrated within\nPhoGAD.\n","authors":["Ziqi Yuan","Haoyi Zhou","Tianyu Chen","Jianxin Li"],"pdf_url":"https://arxiv.org/pdf/2401.10547v1.pdf","comment":"Accepted by WSDM 2024"},{"id":"http://arxiv.org/abs/2401.08169v2","updated":"2024-01-19T07:48:24Z","published":"2024-01-16T07:18:47Z","title":"Statistical Test for Attention Map in Vision Transformer","summary":" The Vision Transformer (ViT) demonstrates exceptional performance in various\ncomputer vision tasks. Attention is crucial for ViT to capture complex\nwide-ranging relationships among image patches, allowing the model to weigh the\nimportance of image patches and aiding our understanding of the decision-making\nprocess. However, when utilizing the attention of ViT as evidence in\nhigh-stakes decision-making tasks such as medical diagnostics, a challenge\narises due to the potential of attention mechanisms erroneously focusing on\nirrelevant regions. In this study, we propose a statistical test for ViT's\nattentions, enabling us to use the attentions as reliable quantitative evidence\nindicators for ViT's decision-making with a rigorously controlled error rate.\nUsing the framework called selective inference, we quantify the statistical\nsignificance of attentions in the form of p-values, which enables the\ntheoretically grounded quantification of the false positive detection\nprobability of attentions. We demonstrate the validity and the effectiveness of\nthe proposed method through numerical experiments and applications to brain\nimage diagnoses.\n","authors":["Tomohiro Shiraishi","Daiki Miwa","Teruyuki Katsuoka","Vo Nguyen Le Duy","Kouichi Taji","Ichiro Takeuchi"],"pdf_url":"https://arxiv.org/pdf/2401.08169v2.pdf","comment":"42pages, 17figures"},{"id":"http://arxiv.org/abs/2401.10541v1","updated":"2024-01-19T07:44:32Z","published":"2024-01-19T07:44:32Z","title":"I-SplitEE: Image classification in Split Computing DNNs with Early Exits","summary":" The recent advances in Deep Neural Networks (DNNs) stem from their\nexceptional performance across various domains. However, their inherent large\nsize hinders deploying these networks on resource-constrained devices like\nedge, mobile, and IoT platforms. Strategies have emerged, from partial cloud\ncomputation offloading (split computing) to integrating early exits within DNN\nlayers. Our work presents an innovative unified approach merging early exits\nand split computing. We determine the 'splitting layer', the optimal depth in\nthe DNN for edge device computations, and whether to infer on edge device or be\noffloaded to the cloud for inference considering accuracy, computational\nefficiency, and communication costs. Also, Image classification faces diverse\nenvironmental distortions, influenced by factors like time of day, lighting,\nand weather. To adapt to these distortions, we introduce I-SplitEE, an online\nunsupervised algorithm ideal for scenarios lacking ground truths and with\nsequential data. Experimental validation using Caltech-256 and Cifar-10\ndatasets subjected to varied distortions showcases I-SplitEE's ability to\nreduce costs by a minimum of 55% with marginal performance degradation of at\nmost 5%.\n","authors":["Divya Jyoti Bajpai","Aastha Jaiswal","Manjesh Kumar Hanawal"],"pdf_url":"https://arxiv.org/pdf/2401.10541v1.pdf","comment":"To appear in proceedings of IEEE International Conference on\n Communications 2024"},{"id":"http://arxiv.org/abs/2401.10535v1","updated":"2024-01-19T07:21:45Z","published":"2024-01-19T07:21:45Z","title":"The \"Colonial Impulse\" of Natural Language Processing: An Audit of\n Bengali Sentiment Analysis Tools and Their Identity-based Biases","summary":" While colonization has sociohistorically impacted people's identities across\nvarious dimensions, those colonial values and biases continue to be perpetuated\nby sociotechnical systems. One category of sociotechnical systems--sentiment\nanalysis tools--can also perpetuate colonial values and bias, yet less\nattention has been paid to how such tools may be complicit in perpetuating\ncoloniality, although they are often used to guide various practices (e.g.,\ncontent moderation). In this paper, we explore potential bias in sentiment\nanalysis tools in the context of Bengali communities that have experienced and\ncontinue to experience the impacts of colonialism. Drawing on identity\ncategories most impacted by colonialism amongst local Bengali communities, we\nfocused our analytic attention on gender, religion, and nationality. We\nconducted an algorithmic audit of all sentiment analysis tools for Bengali,\navailable on the Python package index (PyPI) and GitHub. Despite similar\nsemantic content and structure, our analyses showed that in addition to\ninconsistencies in output from different tools, Bengali sentiment analysis\ntools exhibit bias between different identity categories and respond\ndifferently to different ways of identity expression. Connecting our findings\nwith colonially shaped sociocultural structures of Bengali communities, we\ndiscuss the implications of downstream bias of sentiment analysis tools.\n","authors":["Dipto Das","Shion Guha","Jed Brubaker","Bryan Semaan"],"pdf_url":"https://arxiv.org/pdf/2401.10535v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10529v1","updated":"2024-01-19T07:10:13Z","published":"2024-01-19T07:10:13Z","title":"Mementos: A Comprehensive Benchmark for Multimodal Large Language Model\n Reasoning over Image Sequences","summary":" Multimodal Large Language Models (MLLMs) have demonstrated proficiency in\nhandling a variety of visual-language tasks. However, current MLLM benchmarks\nare predominantly designed to evaluate reasoning based on static information\nabout a single image, and the ability of modern MLLMs to extrapolate from image\nsequences, which is essential for understanding our ever-changing world, has\nbeen less investigated. To address this challenge, this paper introduces\nMementos, a new benchmark designed to assess MLLMs' sequential image reasoning\nabilities. Mementos features 4,761 diverse image sequences with varying\nlengths. We also employ a GPT-4 assisted method to evaluate MLLM reasoning\nperformance. Through a careful evaluation of nine recent MLLMs on Mementos,\nincluding GPT-4V and Gemini, we find that they struggle to accurately describe\ndynamic information about given image sequences, often leading to\nhallucinations/misrepresentations of objects and their corresponding behaviors.\nOur quantitative analysis and case studies identify three key factors impacting\nMLLMs' sequential image reasoning: the correlation between object and\nbehavioral hallucinations, the influence of cooccurring behaviors, and the\ncompounding impact of behavioral hallucinations. Our dataset is available at\nhttps://github.com/umd-huang-lab/Mementos.\n","authors":["Xiyao Wang","Yuhang Zhou","Xiaoyu Liu","Hongjin Lu","Yuancheng Xu","Feihong He","Jaehong Yoon","Taixi Lu","Gedas Bertasius","Mohit Bansal","Huaxiu Yao","Furong Huang"],"pdf_url":"https://arxiv.org/pdf/2401.10529v1.pdf","comment":"27 pages, 23 figures"},{"id":"http://arxiv.org/abs/2401.10522v1","updated":"2024-01-19T06:56:09Z","published":"2024-01-19T06:56:09Z","title":"FARe: Fault-Aware GNN Training on ReRAM-based PIM Accelerators","summary":" Resistive random-access memory (ReRAM)-based processing-in-memory (PIM)\narchitecture is an attractive solution for training Graph Neural Networks\n(GNNs) on edge platforms. However, the immature fabrication process and limited\nwrite endurance of ReRAMs make them prone to hardware faults, thereby limiting\ntheir widespread adoption for GNN training. Further, the existing\nfault-tolerant solutions prove inadequate for effectively training GNNs in the\npresence of faults. In this paper, we propose a fault-aware framework referred\nto as FARe that mitigates the effect of faults during GNN training. FARe\noutperforms existing approaches in terms of both accuracy and timing overhead.\nExperimental results demonstrate that FARe framework can restore GNN test\naccuracy by 47.6% on faulty ReRAM hardware with a ~1% timing overhead compared\nto the fault-free counterpart.\n","authors":["Pratyush Dhingra","Chukwufumnanya Ogbogu","Biresh Kumar Joardar","Janardhan Rao Doppa","Ananth Kalyanaraman","Partha Pratim Pande"],"pdf_url":"https://arxiv.org/pdf/2401.10522v1.pdf","comment":"This paper has been accepted to the conference DATE (Design,\n Automation and Test in Europe) - 2024"},{"id":"http://arxiv.org/abs/2401.10518v1","updated":"2024-01-19T06:26:05Z","published":"2024-01-19T06:26:05Z","title":"Spatial-temporal Forecasting for Regions without Observations","summary":" Spatial-temporal forecasting plays an important role in many real-world\napplications, such as traffic forecasting, air pollutant forecasting,\ncrowd-flow forecasting, and so on. State-of-the-art spatial-temporal\nforecasting models take data-driven approaches and rely heavily on data\navailability. Such models suffer from accuracy issues when data is incomplete,\nwhich is common in reality due to the heavy costs of deploying and maintaining\nsensors for data collection. A few recent studies attempted to address the\nissue of incomplete data. They typically assume some data availability in a\nregion of interest either for a short period or at a few locations. In this\npaper, we further study spatial-temporal forecasting for a region of interest\nwithout any historical observations, to address scenarios such as unbalanced\nregion development, progressive deployment of sensors or lack of open data. We\npropose a model named STSM for the task. The model takes a contrastive\nlearning-based approach to learn spatial-temporal patterns from adjacent\nregions that have recorded data. Our key insight is to learn from the locations\nthat resemble those in the region of interest, and we propose a selective\nmasking strategy to enable the learning. As a result, our model outperforms\nadapted state-of-the-art models, reducing errors consistently over both traffic\nand air pollutant forecasting tasks. The source code is available at\nhttps://github.com/suzy0223/STSM.\n","authors":["Xinyu Su","Jianzhong Qi","Egemen Tanin","Yanchuan Chang","Majid Sarvi"],"pdf_url":"https://arxiv.org/pdf/2401.10518v1.pdf","comment":"Accepted by EDBT2024"},{"id":"http://arxiv.org/abs/2401.07494v2","updated":"2024-01-19T06:16:59Z","published":"2024-01-15T06:26:53Z","title":"Input Convex Lipschitz RNN: A Fast and Robust Approach for Engineering\n Tasks","summary":" Computational efficiency and adversarial robustness are critical factors in\nreal-world engineering applications. Yet, conventional neural networks often\nfall short in addressing both simultaneously, or even separately. Drawing\ninsights from natural physical systems and existing literature, it is known\nthat an input convex architecture enhances computational efficiency, while a\nLipschitz-constrained architecture bolsters adversarial robustness. By\nleveraging the strengths of convexity and Lipschitz continuity, we develop a\nnovel network architecture, termed Input Convex Lipschitz Recurrent Neural\nNetworks. This model outperforms existing recurrent units across a spectrum of\nengineering tasks in terms of computational efficiency and adversarial\nrobustness. These tasks encompass a benchmark MNIST image classification,\nreal-world solar irradiance prediction for Solar PV system planning at LHT\nHoldings in Singapore, and real-time Model Predictive Control optimization for\na chemical reactor.\n","authors":["Zihao Wang","P S Pravin","Zhe Wu"],"pdf_url":"https://arxiv.org/pdf/2401.07494v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10516v1","updated":"2024-01-19T06:14:36Z","published":"2024-01-19T06:14:36Z","title":"Episodic Reinforcement Learning with Expanded State-reward Space","summary":" Empowered by deep neural networks, deep reinforcement learning (DRL) has\ndemonstrated tremendous empirical successes in various domains, including\ngames, health care, and autonomous driving. Despite these advancements, DRL is\nstill identified as data-inefficient as effective policies demand vast numbers\nof environmental samples. Recently, episodic control (EC)-based model-free DRL\nmethods enable sample efficiency by recalling past experiences from episodic\nmemory. However, existing EC-based methods suffer from the limitation of\npotential misalignment between the state and reward spaces for neglecting the\nutilization of (past) retrieval states with extensive information, which\nprobably causes inaccurate value estimation and degraded policy performance. To\ntackle this issue, we introduce an efficient EC-based DRL framework with\nexpanded state-reward space, where the expanded states used as the input and\nthe expanded rewards used in the training both contain historical and current\ninformation. To be specific, we reuse the historical states retrieved by EC as\npart of the input states and integrate the retrieved MC-returns into the\nimmediate reward in each interactive transition. As a result, our method is\nable to simultaneously achieve the full utilization of retrieval information\nand the better evaluation of state values by a Temporal Difference (TD) loss.\nEmpirical results on challenging Box2d and Mujoco tasks demonstrate the\nsuperiority of our method over a recent sibling method and common baselines.\nFurther, we also verify our method's effectiveness in alleviating Q-value\noverestimation by additional experiments of Q-value comparison.\n","authors":["Dayang Liang","Yaru Zhang","Yunlong Liu"],"pdf_url":"https://arxiv.org/pdf/2401.10516v1.pdf","comment":"Accepted at AAMAS'24"},{"id":"http://arxiv.org/abs/2310.05492v3","updated":"2024-01-19T06:06:46Z","published":"2023-10-09T07:56:16Z","title":"How Abilities in Large Language Models are Affected by Supervised\n Fine-tuning Data Composition","summary":" Large language models (LLMs) with enormous pre-training tokens and parameters\nemerge diverse abilities, including math reasoning, code generation, and\ninstruction following. These abilities are further enhanced by supervised\nfine-tuning (SFT). While the open-source community has explored ad-hoc SFT for\nenhancing individual capabilities, proprietary LLMs exhibit versatility across\nvarious skills. Therefore, understanding the facilitation of multiple abilities\nvia SFT is paramount. In this study, we specifically focuses on the interplay\nof data composition between mathematical reasoning, code generation, and\ngeneral human-aligning abilities during SFT. We propose four intriguing\nresearch questions to explore the association between model performance and\nvarious factors including data amount, composition ratio, model size and SFT\nstrategies. Our experiments reveal that distinct capabilities scale differently\nand larger models generally show superior performance with same amount of data.\nMathematical reasoning and code generation consistently improve with increasing\ndata amount, whereas general abilities plateau after roughly a thousand\nsamples. Moreover, we observe data composition appears to enhance various\nabilities under limited data conditions, yet can lead to performance conflicts\nwhen data is plentiful. Our findings also suggest the amount of composition\ndata influences performance more than the composition ratio. In analysis of SFT\nstrategies, we find that sequentially learning multiple skills risks\ncatastrophic forgetting. Our proposed Dual-stage Mixed Fine-tuning (DMT)\nstrategy offers a promising solution to learn multiple abilities with different\nscaling patterns.\n","authors":["Guanting Dong","Hongyi Yuan","Keming Lu","Chengpeng Li","Mingfeng Xue","Dayiheng Liu","Wei Wang","Zheng Yuan","Chang Zhou","Jingren Zhou"],"pdf_url":"https://arxiv.org/pdf/2310.05492v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10510v1","updated":"2024-01-19T05:58:30Z","published":"2024-01-19T05:58:30Z","title":"A match made in consistency heaven: when large language models meet\n evolutionary algorithms","summary":" Pre-trained large language models (LLMs) have powerful capabilities for\ngenerating creative natural text. Evolutionary algorithms (EAs) can discover\ndiverse solutions to complex real-world problems. Motivated by the common\ncollective and directionality of text sequence generation and evolution, this\npaper illustrates the strong consistency of LLMs and EAs, which includes\nmultiple one-to-one key characteristics: token embedding and genotype-phenotype\nmapping, position encoding and fitness shaping, position embedding and\nselection, attention and crossover, feed-forward neural network and mutation,\nmodel training and parameter update, and multi-task learning and\nmulti-objective optimization. Based on this consistency perspective, existing\ncoupling studies are analyzed, including evolutionary fine-tuning and\nLLM-enhanced EAs. Leveraging these insights, we outline a fundamental roadmap\nfor future research in coupling LLMs and EAs, while highlighting key challenges\nalong the way. The consistency not only reveals the evolution mechanism behind\nLLMs but also facilitates the development of evolved artificial agents that\napproach or surpass biological organisms.\n","authors":["Wang Chao","Jiaxuan Zhao","Licheng Jiao","Lingling Li","Fang Liu","Shuyuan Yang"],"pdf_url":"https://arxiv.org/pdf/2401.10510v1.pdf","comment":"A perspective article under review"},{"id":"http://arxiv.org/abs/2311.07202v3","updated":"2024-01-19T05:54:53Z","published":"2023-11-13T09:41:32Z","title":"Input Convex LSTM: A Convex Approach for Fast Lyapunov-Based Model\n Predictive Control","summary":" Leveraging Input Convex Neural Networks (ICNNs), ICNN-based Model Predictive\nControl (MPC) successfully attains globally optimal solutions by upholding\nconvexity within the MPC framework. However, current ICNN architectures\nencounter the issue of vanishing/exploding gradients, which limits their\nability to serve as deep neural networks for complex tasks. Additionally, the\ncurrent neural network-based MPC, including conventional neural network-based\nMPC and ICNN-based MPC, faces slower convergence speed when compared to MPC\nbased on first-principles models. In this study, we leverage the principles of\nICNNs to propose a novel Input Convex LSTM for Lyapunov-based MPC, with the\nspecific goal of reducing convergence time and mitigating the\nvanishing/exploding gradient problem while ensuring closed-loop stability. From\na simulation study of a nonlinear chemical reactor, we observed a mitigation of\nvanishing/exploding gradient problem and a reduction in convergence time, with\na percentage decrease of 46.7%, 31.3%, and 20.2% compared to baseline plain\nRNN, plain LSTM, and Input Convex Recurrent Neural Networks, respectively.\n","authors":["Zihao Wang","Zhe Wu"],"pdf_url":"https://arxiv.org/pdf/2311.07202v3.pdf","comment":"Submitted to 6th Annual Learning for Dynamics & Control Conference\n (L4DC 2024)"},{"id":"http://arxiv.org/abs/2401.08216v2","updated":"2024-01-19T05:31:07Z","published":"2024-01-16T09:02:34Z","title":"Towards Efficient and Certified Recovery from Poisoning Attacks in\n Federated Learning","summary":" Federated learning (FL) is vulnerable to poisoning attacks, where malicious\nclients manipulate their updates to affect the global model. Although various\nmethods exist for detecting those clients in FL, identifying malicious clients\nrequires sufficient model updates, and hence by the time malicious clients are\ndetected, FL models have been already poisoned. Thus, a method is needed to\nrecover an accurate global model after malicious clients are identified.\nCurrent recovery methods rely on (i) all historical information from\nparticipating FL clients and (ii) the initial model unaffected by the malicious\nclients, leading to a high demand for storage and computational resources. In\nthis paper, we show that highly effective recovery can still be achieved based\non (i) selective historical information rather than all historical information\nand (ii) a historical model that has not been significantly affected by\nmalicious clients rather than the initial model. In this scenario, while\nmaintaining comparable recovery performance, we can accelerate the recovery\nspeed and decrease memory consumption. Following this concept, we introduce\nCrab, an efficient and certified recovery method, which relies on selective\ninformation storage and adaptive model rollback. Theoretically, we demonstrate\nthat the difference between the global model recovered by Crab and the one\nrecovered by train-from-scratch can be bounded under certain assumptions. Our\nempirical evaluation, conducted across three datasets over multiple machine\nlearning models, and a variety of untargeted and targeted poisoning attacks\nreveals that Crab is both accurate and efficient, and consistently outperforms\nprevious approaches in terms of both recovery speed and memory consumption.\n","authors":["Yu Jiang","Jiyuan Shen","Ziyao Liu","Chee Wei Tan","Kwok-Yan Lam"],"pdf_url":"https://arxiv.org/pdf/2401.08216v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10495v1","updated":"2024-01-19T05:18:28Z","published":"2024-01-19T05:18:28Z","title":"Causal Layering via Conditional Entropy","summary":" Causal discovery aims to recover information about an unobserved causal graph\nfrom the observable data it generates. Layerings are orderings of the variables\nwhich place causes before effects. In this paper, we provide ways to recover\nlayerings of a graph by accessing the data via a conditional entropy oracle,\nwhen distributions are discrete. Our algorithms work by repeatedly removing\nsources or sinks from the graph. Under appropriate assumptions and\nconditioning, we can separate the sources or sinks from the remainder of the\nnodes by comparing their conditional entropy to the unconditional entropy of\ntheir noise. Our algorithms are provably correct and run in worst-case\nquadratic time. The main assumptions are faithfulness and injective noise, and\neither known noise entropies or weakly monotonically increasing noise entropies\nalong directed paths. In addition, we require one of either a very mild\nextension of faithfulness, or strictly monotonically increasing noise\nentropies, or expanding noise injectivity to include an additional single\nargument in the structural functions.\n","authors":["Itai Feigenbaum","Devansh Arpit","Huan Wang","Shelby Heinecke","Juan Carlos Niebles","Weiran Yao","Caiming Xiong","Silvio Savarese"],"pdf_url":"https://arxiv.org/pdf/2401.10495v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10490v1","updated":"2024-01-19T05:01:43Z","published":"2024-01-19T05:01:43Z","title":"Generalization Error Guaranteed Auto-Encoder-Based Nonlinear Model\n Reduction for Operator Learning","summary":" Many physical processes in science and engineering are naturally represented\nby operators between infinite-dimensional function spaces. The problem of\noperator learning, in this context, seeks to extract these physical processes\nfrom empirical data, which is challenging due to the infinite or high\ndimensionality of data. An integral component in addressing this challenge is\nmodel reduction, which reduces both the data dimensionality and problem size.\nIn this paper, we utilize low-dimensional nonlinear structures in model\nreduction by investigating Auto-Encoder-based Neural Network (AENet). AENet\nfirst learns the latent variables of the input data and then learns the\ntransformation from these latent variables to corresponding output data. Our\nnumerical experiments validate the ability of AENet to accurately learn the\nsolution operator of nonlinear partial differential equations. Furthermore, we\nestablish a mathematical and statistical estimation theory that analyzes the\ngeneralization error of AENet. Our theoretical framework shows that the sample\ncomplexity of training AENet is intricately tied to the intrinsic dimension of\nthe modeled process, while also demonstrating the remarkable resilience of\nAENet to noise.\n","authors":["Hao Liu","Biraj Dahal","Rongjie Lai","Wenjing Liao"],"pdf_url":"https://arxiv.org/pdf/2401.10490v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.06120v3","updated":"2024-01-19T04:13:33Z","published":"2023-02-13T06:00:56Z","title":"Knowledge from Large-Scale Protein Contact Prediction Models Can Be\n Transferred to the Data-Scarce RNA Contact Prediction Task","summary":" RNA, whose functionality is largely determined by its structure, plays an\nimportant role in many biological activities. The prediction of pairwise\nstructural proximity between each nucleotide of an RNA sequence can\ncharacterize the structural information of the RNA. Historically, this problem\nhas been tackled by machine learning models using expert-engineered features\nand trained on scarce labeled datasets. Here, we find that the knowledge\nlearned by a protein-coevolution Transformer-based deep neural network can be\ntransferred to the RNA contact prediction task. As protein datasets are orders\nof magnitude larger than those for RNA contact prediction, our findings and the\nsubsequent framework greatly reduce the data scarcity bottleneck. Experiments\nconfirm that RNA contact prediction through transfer learning using a publicly\navailable protein model is greatly improved. Our findings indicate that the\nlearned structural patterns of proteins can be transferred to RNAs, opening up\npotential new avenues for research.\n","authors":["Yiren Jian","Chongyang Gao","Chen Zeng","Yunjie Zhao","Soroush Vosoughi"],"pdf_url":"https://arxiv.org/pdf/2302.06120v3.pdf","comment":"The code is available at\n https://github.com/yiren-jian/CoT-RNA-Transfer"},{"id":"http://arxiv.org/abs/2401.10478v1","updated":"2024-01-19T04:02:49Z","published":"2024-01-19T04:02:49Z","title":"Budgeted Online Model Selection and Fine-Tuning via Federated Learning","summary":" Online model selection involves selecting a model from a set of candidate\nmodels 'on the fly' to perform prediction on a stream of data. The choice of\ncandidate models henceforth has a crucial impact on the performance. Although\nemploying a larger set of candidate models naturally leads to more flexibility\nin model selection, this may be infeasible in cases where prediction tasks are\nperformed on edge devices with limited memory. Faced with this challenge, the\npresent paper proposes an online federated model selection framework where a\ngroup of learners (clients) interacts with a server with sufficient memory such\nthat the server stores all candidate models. However, each client only chooses\nto store a subset of models that can be fit into its memory and performs its\nown prediction task using one of the stored models. Furthermore, employing the\nproposed algorithm, clients and the server collaborate to fine-tune models to\nadapt them to a non-stationary environment. Theoretical analysis proves that\nthe proposed algorithm enjoys sub-linear regret with respect to the best model\nin hindsight. Experiments on real datasets demonstrate the effectiveness of the\nproposed algorithm.\n","authors":["Pouya M. Ghari","Yanning Shen"],"pdf_url":"https://arxiv.org/pdf/2401.10478v1.pdf","comment":"Accepted by Transactions on Machine Learning Research (TMLR)"},{"id":"http://arxiv.org/abs/2401.10474v1","updated":"2024-01-19T03:50:19Z","published":"2024-01-19T03:50:19Z","title":"LDReg: Local Dimensionality Regularized Self-Supervised Learning","summary":" Representations learned via self-supervised learning (SSL) can be susceptible\nto dimensional collapse, where the learned representation subspace is of\nextremely low dimensionality and thus fails to represent the full data\ndistribution and modalities. Dimensional collapse also known as the\n\"underfilling\" phenomenon is one of the major causes of degraded performance on\ndownstream tasks. Previous work has investigated the dimensional collapse\nproblem of SSL at a global level. In this paper, we demonstrate that\nrepresentations can span over high dimensional space globally, but collapse\nlocally. To address this, we propose a method called $\\textit{local\ndimensionality regularization (LDReg)}$. Our formulation is based on the\nderivation of the Fisher-Rao metric to compare and optimize local distance\ndistributions at an asymptotically small radius for each data point. By\nincreasing the local intrinsic dimensionality, we demonstrate through a range\nof experiments that LDReg improves the representation quality of SSL. The\nresults also show that LDReg can regularize dimensionality at both local and\nglobal levels.\n","authors":["Hanxun Huang","Ricardo J. G. B. Campello","Sarah Monazam Erfani","Xingjun Ma","Michael E. Houle","James Bailey"],"pdf_url":"https://arxiv.org/pdf/2401.10474v1.pdf","comment":"ICLR 2024"},{"id":"http://arxiv.org/abs/2401.10467v1","updated":"2024-01-19T03:39:43Z","published":"2024-01-19T03:39:43Z","title":"Learning Backdoors for Mixed Integer Programs with Contrastive Learning","summary":" Many real-world problems can be efficiently modeled as Mixed Integer Programs\n(MIPs) and solved with the Branch-and-Bound method. Prior work has shown the\nexistence of MIP backdoors, small sets of variables such that prioritizing\nbranching on them when possible leads to faster running times. However, finding\nhigh-quality backdoors that improve running times remains an open question.\nPrevious work learns to estimate the relative solver speed of randomly sampled\nbackdoors through ranking and then decide whether to use it. In this paper, we\nutilize the Monte-Carlo tree search method to collect backdoors for training,\nrather than relying on random sampling, and adapt a contrastive learning\nframework to train a Graph Attention Network model to predict backdoors. Our\nmethod, evaluated on four common MIP problem domains, demonstrates performance\nimprovements over both Gurobi and previous models.\n","authors":["Junyang Cai","Taoan Huang","Bistra Dilkina"],"pdf_url":"https://arxiv.org/pdf/2401.10467v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.05225v2","updated":"2024-01-19T03:34:11Z","published":"2023-12-08T18:20:43Z","title":"Neural Spectral Methods: Self-supervised learning in the spectral domain","summary":" We present Neural Spectral Methods, a technique to solve parametric Partial\nDifferential Equations (PDEs), grounded in classical spectral methods. Our\nmethod uses orthogonal bases to learn PDE solutions as mappings between\nspectral coefficients. In contrast to current machine learning approaches which\nenforce PDE constraints by minimizing the numerical quadrature of the residuals\nin the spatiotemporal domain, we leverage Parseval's identity and introduce a\nnew training strategy through a \\textit{spectral loss}. Our spectral loss\nenables more efficient differentiation through the neural network, and\nsubstantially reduces training complexity. At inference time, the computational\ncost of our method remains constant, regardless of the spatiotemporal\nresolution of the domain. Our experimental results demonstrate that our method\nsignificantly outperforms previous machine learning approaches in terms of\nspeed and accuracy by one to two orders of magnitude on multiple different\nproblems. When compared to numerical solvers of the same accuracy, our method\ndemonstrates a $10\\times$ increase in performance speed.\n","authors":["Yiheng Du","Nithin Chalapathi","Aditi Krishnapriyan"],"pdf_url":"https://arxiv.org/pdf/2312.05225v2.pdf","comment":"Accepted to International Conference on Learning Representations\n (ICLR) 2024"},{"id":"http://arxiv.org/abs/2401.10463v1","updated":"2024-01-19T03:24:36Z","published":"2024-01-19T03:24:36Z","title":"Critical Data Size of Language Models from a Grokking Perspective","summary":" We explore the critical data size in language models, a threshold that marks\na fundamental shift from quick memorization to slow generalization. We\nformalize the phase transition under the grokking configuration into the Data\nEfficiency Hypothesis and identify data insufficiency, sufficiency, and surplus\nregimes in language models training dynamics. We develop a grokking\nconfiguration to reproduce grokking on simplistic language models stably by\nrescaling initialization and weight decay. We show that generalization occurs\nonly when language models reach a critical size. We analyze grokking across\nsample-wise and model-wise, verifying the proposed data efficiency hypothesis.\nOur experiments reveal smoother phase transitions occurring at the critical\ndataset size for language datasets. As the model size increases, this critical\npoint also becomes larger, indicating that larger models require more data. Our\nresults deepen the understanding of language model training, offering a novel\nperspective on the role of data in the learning mechanism of language models.\n","authors":["Xuekai Zhu","Yao Fu","Bowen Zhou","Zhouhan Lin"],"pdf_url":"https://arxiv.org/pdf/2401.10463v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.11171v4","updated":"2024-01-19T03:23:21Z","published":"2023-04-21T03:26:29Z","title":"Granular-ball computing: an efficient, robust, and interpretable\n adaptive multi-granularity representation and computation method","summary":" Human cognition operates on a \"Global-first\" cognitive mechanism,\nprioritizing information processing based on coarse-grained details. This\nmechanism inherently possesses an adaptive multi-granularity description\ncapacity, resulting in computational traits such as efficiency, robustness, and\ninterpretability. The analysis pattern reliance on the finest granularity and\nsingle-granularity makes most existing computational methods less efficient,\nrobust, and interpretable, which is an important reason for the current lack of\ninterpretability in neural networks. Multi-granularity granular-ball computing\nemploys granular-balls of varying sizes to daptively represent and envelop the\nsample space, facilitating learning based on these granular-balls. Given that\nthe number of coarse-grained \"granular-balls\" is fewer than sample points,\ngranular-ball computing proves more efficient. Moreover, the inherent\ncoarse-grained nature of granular-balls reduces susceptibility to fine-grained\nsample disturbances, enhancing robustness. The multi-granularity construct of\ngranular-balls generates topological structures and coarse-grained\ndescriptions, naturally augmenting interpretability. Granular-ball computing\nhas successfully ventured into diverse AI domains, fostering the development of\ninnovative theoretical methods, including granular-ball classifiers, clustering\ntechniques, neural networks, rough sets, and evolutionary computing. This has\nnotably ameliorated the efficiency, noise robustness, and interpretability of\ntraditional methods. Overall, granular-ball computing is a rare and innovative\ntheoretical approach in AI that can adaptively and simultaneously enhance\nefficiency, robustness, and interpretability. This article delves into the main\napplication landscapes for granular-ball computing, aiming to equip future\nresearchers with references and insights to refine and expand this promising\ntheory.\n","authors":["Shuyin Xia","Guoyin Wang","Xinbo Gao","Xiaoyu Lian"],"pdf_url":"https://arxiv.org/pdf/2304.11171v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2212.01521v2","updated":"2024-01-19T03:21:28Z","published":"2022-12-03T03:39:44Z","title":"Distribution Fitting for Combating Mode Collapse in Generative\n Adversarial Networks","summary":" Mode collapse is a significant unsolved issue of generative adversarial\nnetworks. In this work, we examine the causes of mode collapse from a novel\nperspective. Due to the nonuniform sampling in the training process, some\nsub-distributions may be missed when sampling data. As a result, even when the\ngenerated distribution differs from the real one, the GAN objective can still\nachieve the minimum. To address the issue, we propose a global distribution\nfitting (GDF) method with a penalty term to confine the generated data\ndistribution. When the generated distribution differs from the real one, GDF\nwill make the objective harder to reach the minimal value, while the original\nglobal minimum is not changed. To deal with the circumstance when the overall\nreal data is unreachable, we also propose a local distribution fitting (LDF)\nmethod. Experiments on several benchmarks demonstrate the effectiveness and\ncompetitive performance of GDF and LDF.\n","authors":["Yanxiang Gong","Zhiwei Xie","Guozhen Duan","Zheng Ma","Mei Xie"],"pdf_url":"https://arxiv.org/pdf/2212.01521v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.18426v3","updated":"2024-01-19T02:56:41Z","published":"2023-11-30T10:24:07Z","title":"Convergence Analysis of Fractional Gradient Descent","summary":" Fractional derivatives are a well-studied generalization of integer order\nderivatives. Naturally, for optimization, it is of interest to understand the\nconvergence properties of gradient descent using fractional derivatives.\nConvergence analysis of fractional gradient descent is currently limited both\nin the methods analyzed and the settings analyzed. This paper aims to fill in\nthese gaps by analyzing variations of fractional gradient descent in smooth and\nconvex, smooth and strongly convex, and smooth and non-convex settings. First,\nnovel bounds will be established bridging fractional and integer derivatives.\nThen, these bounds will be applied to the aforementioned settings to prove\nlinear convergence for smooth and strongly convex functions and $O(1/T)$\nconvergence for smooth and convex functions. Additionally, we prove $O(1/T)$\nconvergence for smooth and non-convex functions using an extended notion of\nsmoothness - H\\\"older smoothness - that is more natural for fractional\nderivatives. Finally, empirical results will be presented on the potential\nspeed up of fractional gradient descent over standard gradient descent as well\nas the challenges of predicting which will be faster in general.\n","authors":["Ashwani Aggarwal"],"pdf_url":"https://arxiv.org/pdf/2311.18426v3.pdf","comment":"24 pages, 4 figures. Added additional results for smooth and convex\n functions"},{"id":"http://arxiv.org/abs/2401.10460v1","updated":"2024-01-19T02:51:00Z","published":"2024-01-19T02:51:00Z","title":"Ultra-lightweight Neural Differential DSP Vocoder For High Quality\n Speech Synthesis","summary":" Neural vocoders model the raw audio waveform and synthesize high-quality\naudio, but even the highly efficient ones, like MB-MelGAN and LPCNet, fail to\nrun real-time on a low-end device like a smartglass. A pure digital signal\nprocessing (DSP) based vocoder can be implemented via lightweight fast Fourier\ntransforms (FFT), and therefore, is a magnitude faster than any neural vocoder.\nA DSP vocoder often gets a lower audio quality due to consuming over-smoothed\nacoustic model predictions of approximate representations for the vocal tract.\nIn this paper, we propose an ultra-lightweight differential DSP (DDSP) vocoder\nthat uses a jointly optimized acoustic model with a DSP vocoder, and learns\nwithout an extracted spectral feature for the vocal tract. The model achieves\naudio quality comparable to neural vocoders with a high average MOS of 4.36\nwhile being efficient as a DSP vocoder. Our C++ implementation, without any\nhardware-specific optimization, is at 15 MFLOPS, surpasses MB-MelGAN by 340\ntimes in terms of FLOPS, and achieves a vocoder-only RTF of 0.003 and overall\nRTF of 0.044 while running single-threaded on a 2GHz Intel Xeon CPU.\n","authors":["Prabhav Agrawal","Thilo Koehler","Zhiping Xiu","Prashant Serai","Qing He"],"pdf_url":"https://arxiv.org/pdf/2401.10460v1.pdf","comment":"Accepted for ICASSP 2024"},{"id":"http://arxiv.org/abs/2310.03320v4","updated":"2024-01-19T02:47:51Z","published":"2023-10-05T05:30:42Z","title":"BioBridge: Bridging Biomedical Foundation Models via Knowledge Graphs","summary":" Foundation models (FMs) are able to leverage large volumes of unlabeled data\nto demonstrate superior performance across a wide range of tasks. However, FMs\ndeveloped for biomedical domains have largely remained unimodal, i.e.,\nindependently trained and used for tasks on protein sequences alone, small\nmolecule structures alone, or clinical data alone. To overcome this limitation\nof biomedical FMs, we present BioBridge, a novel parameter-efficient learning\nframework, to bridge independently trained unimodal FMs to establish multimodal\nbehavior. BioBridge achieves it by utilizing Knowledge Graphs (KG) to learn\ntransformations between one unimodal FM and another without fine-tuning any\nunderlying unimodal FMs. Our empirical results demonstrate that BioBridge can\nbeat the best baseline KG embedding methods (on average by around 76.3%) in\ncross-modal retrieval tasks. We also identify BioBridge demonstrates\nout-of-domain generalization ability by extrapolating to unseen modalities or\nrelations. Additionally, we also show that BioBridge presents itself as a\ngeneral purpose retriever that can aid biomedical multimodal question answering\nas well as enhance the guided generation of novel drugs.\n","authors":["Zifeng Wang","Zichen Wang","Balasubramaniam Srinivasan","Vassilis N. Ioannidis","Huzefa Rangwala","Rishita Anubhai"],"pdf_url":"https://arxiv.org/pdf/2310.03320v4.pdf","comment":"ICLR 2024"},{"id":"http://arxiv.org/abs/2311.15497v3","updated":"2024-01-19T02:45:44Z","published":"2023-11-27T02:48:06Z","title":"Adaptive Image Registration: A Hybrid Approach Integrating Deep Learning\n and Optimization Functions for Enhanced Precision","summary":" Image registration has traditionally been done using two distinct approaches:\nlearning based methods, relying on robust deep neural networks, and\noptimization-based methods, applying complex mathematical transformations to\nwarp images accordingly. Of course, both paradigms offer advantages and\ndisadvantages, and, in this work, we seek to combine their respective strengths\ninto a single streamlined framework, using the outputs of the learning based\nmethod as initial parameters for optimization while prioritizing computational\npower for the image pairs that offer the greatest loss. Our investigations\nshowed improvements of up to 1.6% in test data, while maintaining the same\ninference time, and a substantial 1.0% points performance gain in deformation\nfield smoothness.\n","authors":["Gabriel De Araujo","Shanlin Sun","Xiaohui Xie"],"pdf_url":"https://arxiv.org/pdf/2311.15497v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.08897v2","updated":"2024-01-19T02:39:59Z","published":"2024-01-17T00:46:24Z","title":"CFASL: Composite Factor-Aligned Symmetry Learning for Disentanglement in\n Variational AutoEncoder","summary":" Symmetries of input and latent vectors have provided valuable insights for\ndisentanglement learning in VAEs.However, only a few works were proposed as an\nunsupervised method, and even these works require known factor information in\ntraining data. We propose a novel method, Composite Factor-Aligned Symmetry\nLearning (CFASL), which is integrated into VAEs for learning symmetry-based\ndisentanglement in unsupervised learning without any knowledge of the dataset\nfactor information.CFASL incorporates three novel features for learning\nsymmetry-based disentanglement: 1) Injecting inductive bias to align latent\nvector dimensions to factor-aligned symmetries within an explicit learnable\nsymmetry codebook 2) Learning a composite symmetry to express unknown factors\nchange between two random samples by learning factor-aligned symmetries within\nthe codebook 3) Inducing group equivariant encoder and decoder in training VAEs\nwith the two conditions. In addition, we propose an extended evaluation metric\nfor multi-factor changes in comparison to disentanglement evaluation in VAEs.\nIn quantitative and in-depth qualitative analysis, CFASL demonstrates a\nsignificant improvement of disentanglement in single-factor change, and\nmulti-factor change conditions compared to state-of-the-art methods.\n","authors":["Hee-Jun Jung","Jaehyoung Jeong","Kangil Kim"],"pdf_url":"https://arxiv.org/pdf/2401.08897v2.pdf","comment":"21 pages, 14 figures"},{"id":"http://arxiv.org/abs/2303.03183v2","updated":"2024-01-19T02:31:58Z","published":"2023-03-03T03:17:45Z","title":"Utilizing synthetic training data for the supervised classification of\n rat ultrasonic vocalizations","summary":" Murine rodents generate ultrasonic vocalizations (USVs) with frequencies that\nextend to around 120kHz. These calls are important in social behaviour, and so\ntheir analysis can provide insights into the function of vocal communication,\nand its dysfunction. The manual identification of USVs, and subsequent\nclassification into different subcategories is time consuming. Although machine\nlearning approaches for identification and classification can lead to enormous\nefficiency gains, the time and effort required to generate training data can be\nhigh, and the accuracy of current approaches can be problematic. Here we\ncompare the detection and classification performance of a trained human against\ntwo convolutional neural networks (CNNs), DeepSqueak and VocalMat, on audio\ncontaining rat USVs. Furthermore, we test the effect of inserting synthetic\nUSVs into the training data of the VocalMat CNN as a means of reducing the\nworkload associated with generating a training set. Our results indicate that\nVocalMat outperformed the DeepSqueak CNN on measures of call identification,\nand classification. Additionally, we found that the augmentation of training\ndata with synthetic images resulted in a further improvement in accuracy, such\nthat it was sufficiently close to human performance to allow for the use of\nthis software in laboratory conditions.\n","authors":["K. Jack Scott","Lucinda J. Speers","David K. Bilkey"],"pdf_url":"https://arxiv.org/pdf/2303.03183v2.pdf","comment":"25 pages, 5 main figures, 2 tables"},{"id":"http://arxiv.org/abs/2302.13854v2","updated":"2024-01-19T02:19:29Z","published":"2023-02-24T04:28:46Z","title":"A Deep Neural Network Based Reverse Radio Spectrogram Search Algorithm","summary":" Modern radio astronomy instruments generate vast amounts of data, and the\nincreasingly challenging radio frequency interference (RFI) environment\nnecessitates ever-more sophisticated RFI rejection algorithms. The \"needle in a\nhaystack\" nature of searches for transients and technosignatures requires us to\ndevelop methods that can determine whether a signal of interest has unique\nproperties, or is a part of some larger set of pernicious RFI. In the past,\nthis vetting has required onerous manual inspection of very large numbers of\nsignals. In this paper we present a fast and modular deep learning algorithm to\nsearch for lookalike signals of interest in radio spectrogram data. First, we\ntrained a B-Variational Autoencoder on signals returned by an energy detection\nalgorithm. We then adapted a positional embedding layer from classical\nTransformer architecture to a embed additional metadata, which we demonstrate\nusing a frequency-based embedding. Next we used the encoder component of the\nB-Variational Autoencoder to extract features from small (~ 715,Hz, with a\nresolution of 2.79Hz per frequency bin) windows in the radio spectrogram. We\nused our algorithm to conduct a search for a given query (encoded signal of\ninterest) on a set of signals (encoded features of searched items) to produce\nthe top candidates with similar features. We successfully demonstrate that the\nalgorithm retrieves signals with similar appearance, given only the original\nradio spectrogram data. This algorithm can be used to improve the efficiency of\nvetting signals of interest in technosignature searches, but could also be\napplied to a wider variety of searches for \"lookalike\" signals in large\nastronomical datasets.\n","authors":["Peter Xiangyuan Ma","Steve Croft","Chris Lintott","Andrew P. V. Siemion"],"pdf_url":"https://arxiv.org/pdf/2302.13854v2.pdf","comment":"8 pages, 8 figures"},{"id":"http://arxiv.org/abs/2401.10458v1","updated":"2024-01-19T02:16:30Z","published":"2024-01-19T02:16:30Z","title":"Contrastive Unlearning: A Contrastive Approach to Machine Unlearning","summary":" Machine unlearning aims to eliminate the influence of a subset of training\nsamples (i.e., unlearning samples) from a trained model. Effectively and\nefficiently removing the unlearning samples without negatively impacting the\noverall model performance is still challenging. In this paper, we propose a\ncontrastive unlearning framework, leveraging the concept of representation\nlearning for more effective unlearning. It removes the influence of unlearning\nsamples by contrasting their embeddings against the remaining samples so that\nthey are pushed away from their original classes and pulled toward other\nclasses. By directly optimizing the representation space, it effectively\nremoves the influence of unlearning samples while maintaining the\nrepresentations learned from the remaining samples. Experiments on a variety of\ndatasets and models on both class unlearning and sample unlearning showed that\ncontrastive unlearning achieves the best unlearning effects and efficiency with\nthe lowest performance loss compared with the state-of-the-art algorithms.\n","authors":["Hong kyu Lee","Qiuchen Zhang","Carl Yang","Jian Lou","Li Xiong"],"pdf_url":"https://arxiv.org/pdf/2401.10458v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10451v1","updated":"2024-01-19T01:40:58Z","published":"2024-01-19T01:40:58Z","title":"Learning-assisted Stochastic Capacity Expansion Planning: A Bayesian\n Optimization Approach","summary":" Solving large-scale capacity expansion problems (CEPs) is central to\ncost-effective decarbonization of regional-scale energy systems. To ensure the\nintended outcomes of CEPs, modeling uncertainty due to weather-dependent\nvariable renewable energy (VRE) supply and energy demand becomes crucially\nimportant. However, the resulting stochastic optimization models are often less\ncomputationally tractable than their deterministic counterparts. Here, we\npropose a learning-assisted approximate solution method to tractably solve\ntwo-stage stochastic CEPs. Our method identifies low-cost planning decisions by\nconstructing and solving a sequence of tractable temporally aggregated\nsurrogate problems. We adopt a Bayesian optimization approach to searching the\nspace of time series aggregation hyperparameters and compute approximate\nsolutions that minimize costs on a validation set of supply-demand projections.\nImportantly, we evaluate solved planning outcomes on a held-out set of test\nprojections. We apply our approach to generation and transmission expansion\nplanning for a joint power-gas system spanning New England. We show that our\napproach yields an estimated cost savings of up to 3.8% in comparison to\nbenchmark time series aggregation approaches.\n","authors":["Aron Brenner","Rahman Khorramfar","Dharik Mallapragada","Saurabh Amin"],"pdf_url":"https://arxiv.org/pdf/2401.10451v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2205.05359v3","updated":"2024-01-19T01:30:56Z","published":"2022-05-11T09:11:02Z","title":"Exploring Local Explanations of Nonlinear Models Using Animated Linear\n Projections","summary":" The increased predictive power of machine learning models comes at the cost\nof increased complexity and loss of interpretability, particularly in\ncomparison to parametric statistical models. This trade-off has led to the\nemergence of eXplainable AI (XAI) which provides methods, such as local\nexplanations (LEs) and local variable attributions (LVAs), to shed light on how\na model use predictors to arrive at a prediction. These provide a point\nestimate of the linear variable importance in the vicinity of a single\nobservation. However, LVAs tend not to effectively handle association between\npredictors. To understand how the interaction between predictors affects the\nvariable importance estimate, we can convert LVAs into linear projections and\nuse the radial tour. This is also useful for learning how a model has made a\nmistake, or the effect of outliers, or the clustering of observations. The\napproach is illustrated with examples from categorical (penguin species,\nchocolate types) and quantitative (soccer/football salaries, house prices)\nresponse models. The methods are implemented in the R package cheem, available\non CRAN.\n","authors":["Nicholas Spyrison","Dianne Cook","Przemyslaw Biecek"],"pdf_url":"https://arxiv.org/pdf/2205.05359v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10447v1","updated":"2024-01-19T01:30:16Z","published":"2024-01-19T01:30:16Z","title":"Investigating Training Strategies and Model Robustness of Low-Rank\n Adaptation for Language Modeling in Speech Recognition","summary":" The use of low-rank adaptation (LoRA) with frozen pretrained language models\n(PLMs) has become increasing popular as a mainstream, resource-efficient\nmodeling approach for memory-constrained hardware. In this study, we first\nexplore how to enhance model performance by introducing various LoRA training\nstrategies, achieving relative word error rate reductions of 3.50\\% on the\npublic Librispeech dataset and of 3.67\\% on an internal dataset in the\nmessaging domain. To further characterize the stability of LoRA-based\nsecond-pass speech recognition models, we examine robustness against input\nperturbations. These perturbations are rooted in homophone replacements and a\nnovel metric called N-best Perturbation-based Rescoring Robustness (NPRR), both\ndesigned to measure the relative degradation in the performance of rescoring\nmodels. Our experimental results indicate that while advanced variants of LoRA,\nsuch as dynamic rank-allocated LoRA, lead to performance degradation in\n$1$-best perturbation, they alleviate the degradation in $N$-best perturbation.\nThis finding is in comparison to fully-tuned models and vanilla LoRA tuning\nbaselines, suggesting that a comprehensive selection is needed when using\nLoRA-based adaptation for compute-cost savings and robust language modeling.\n","authors":["Yu Yu","Chao-Han Huck Yang","Tuan Dinh","Sungho Ryu","Jari Kolehmainen","Roger Ren","Denis Filimonov","Prashanth G. Shivakumar","Ankur Gandhe","Ariya Rastow","Jia Xu","Ivan Bulyko","Andreas Stolcke"],"pdf_url":"https://arxiv.org/pdf/2401.10447v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.04336v3","updated":"2024-01-19T01:30:04Z","published":"2024-01-09T03:29:40Z","title":"Deep Efficient Private Neighbor Generation for Subgraph Federated\n Learning","summary":" Behemoth graphs are often fragmented and separately stored by multiple data\nowners as distributed subgraphs in many realistic applications. Without harming\ndata privacy, it is natural to consider the subgraph federated learning\n(subgraph FL) scenario, where each local client holds a subgraph of the entire\nglobal graph, to obtain globally generalized graph mining models. To overcome\nthe unique challenge of incomplete information propagation on local subgraphs\ndue to missing cross-subgraph neighbors, previous works resort to the\naugmentation of local neighborhoods through the joint FL of missing neighbor\ngenerators and GNNs. Yet their technical designs have profound limitations\nregarding the utility, efficiency, and privacy goals of FL. In this work, we\npropose FedDEP to comprehensively tackle these challenges in subgraph FL.\nFedDEP consists of a series of novel technical designs: (1) Deep neighbor\ngeneration through leveraging the GNN embeddings of potential missing\nneighbors; (2) Efficient pseudo-FL for neighbor generation through embedding\nprototyping; and (3) Privacy protection through noise-less\nedge-local-differential-privacy. We analyze the correctness and efficiency of\nFedDEP, and provide theoretical guarantees on its privacy. Empirical results on\nfour real-world datasets justify the clear benefits of proposed techniques.\n","authors":["Ke Zhang","Lichao Sun","Bolin Ding","Siu Ming Yiu","Carl Yang"],"pdf_url":"https://arxiv.org/pdf/2401.04336v3.pdf","comment":"Accepted to SDM 2024"},{"id":"http://arxiv.org/abs/2401.10446v1","updated":"2024-01-19T01:29:27Z","published":"2024-01-19T01:29:27Z","title":"Large Language Models are Efficient Learners of Noise-Robust Speech\n Recognition","summary":" Recent advances in large language models (LLMs) have promoted generative\nerror correction (GER) for automatic speech recognition (ASR), which leverages\nthe rich linguistic knowledge and powerful reasoning ability of LLMs to improve\nrecognition results. The latest work proposes a GER benchmark with HyPoradise\ndataset to learn the mapping from ASR N-best hypotheses to ground-truth\ntranscription by efficient LLM finetuning, which shows great effectiveness but\nlacks specificity on noise-robust ASR. In this work, we extend the benchmark to\nnoisy conditions and investigate if we can teach LLMs to perform denoising for\nGER just like what robust ASR do}, where one solution is introducing noise\ninformation as a conditioner into LLM. However, directly incorporating noise\nembeddings from audio encoder could harm the LLM tuning due to cross-modality\ngap. To this end, we propose to extract a language-space noise embedding from\nthe N-best list to represent the noise conditions of source speech, which can\npromote the denoising process in GER. Furthermore, in order to enhance its\nrepresentation ability of audio noise, we design a knowledge distillation (KD)\napproach via mutual information estimation to distill the real noise\ninformation in audio embeddings to our language embedding. Experiments on\nvarious latest LLMs demonstrate our approach achieves a new breakthrough with\nup to 53.9% correction improvement in terms of word error rate while with\nlimited training data. Analysis shows that our language-space noise embedding\ncan well represent the noise conditions of source speech, under which\noff-the-shelf LLMs show strong ability of language-space denoising.\n","authors":["Yuchen Hu","Chen Chen","Chao-Han Huck Yang","Ruizhe Li","Chao Zhang","Pin-Yu Chen","EnSiong Chng"],"pdf_url":"https://arxiv.org/pdf/2401.10446v1.pdf","comment":"Accepted to ICLR 2024, Spotlight top 5%, 24 pages. This work will be\n open sourced at: https://github.com/YUCHEN005/RobustGER under MIT license"},{"id":"http://arxiv.org/abs/2312.10401v2","updated":"2024-01-19T01:25:39Z","published":"2023-12-16T10:05:18Z","title":"Rethinking Dimensional Rationale in Graph Contrastive Learning from\n Causal Perspective","summary":" Graph contrastive learning is a general learning paradigm excelling at\ncapturing invariant information from diverse perturbations in graphs. Recent\nworks focus on exploring the structural rationale from graphs, thereby\nincreasing the discriminability of the invariant information. However, such\nmethods may incur in the mis-learning of graph models towards the\ninterpretability of graphs, and thus the learned noisy and task-agnostic\ninformation interferes with the prediction of graphs. To this end, with the\npurpose of exploring the intrinsic rationale of graphs, we accordingly propose\nto capture the dimensional rationale from graphs, which has not received\nsufficient attention in the literature. The conducted exploratory experiments\nattest to the feasibility of the aforementioned roadmap. To elucidate the\ninnate mechanism behind the performance improvement arising from the\ndimensional rationale, we rethink the dimensional rationale in graph\ncontrastive learning from a causal perspective and further formalize the\ncausality among the variables in the pre-training stage to build the\ncorresponding structural causal model. On the basis of the understanding of the\nstructural causal model, we propose the dimensional rationale-aware graph\ncontrastive learning approach, which introduces a learnable dimensional\nrationale acquiring network and a redundancy reduction constraint. The\nlearnable dimensional rationale acquiring network is updated by leveraging a\nbi-level meta-learning technique, and the redundancy reduction constraint\ndisentangles the redundant features through a decorrelation process during\nlearning. Empirically, compared with state-of-the-art methods, our method can\nyield significant performance boosts on various benchmarks with respect to\ndiscriminability and transferability. The code implementation of our method is\navailable at https://github.com/ByronJi/DRGCL.\n","authors":["Qirui Ji","Jiangmeng Li","Jie Hu","Rui Wang","Changwen Zheng","Fanjiang Xu"],"pdf_url":"https://arxiv.org/pdf/2312.10401v2.pdf","comment":"Accepted by AAAI2024"},{"id":"http://arxiv.org/abs/2401.10442v1","updated":"2024-01-19T01:11:44Z","published":"2024-01-19T01:11:44Z","title":"Path Choice Matters for Clear Attribution in Path Methods","summary":" Rigorousness and clarity are both essential for interpretations of DNNs to\nengender human trust. Path methods are commonly employed to generate rigorous\nattributions that satisfy three axioms. However, the meaning of attributions\nremains ambiguous due to distinct path choices. To address the ambiguity, we\nintroduce \\textbf{Concentration Principle}, which centrally allocates high\nattributions to indispensable features, thereby endowing aesthetic and\nsparsity. We then present \\textbf{SAMP}, a model-agnostic interpreter, which\nefficiently searches the near-optimal path from a pre-defined set of\nmanipulation paths. Moreover, we propose the infinitesimal constraint (IC) and\nmomentum strategy (MS) to improve the rigorousness and optimality.\nVisualizations show that SAMP can precisely reveal DNNs by pinpointing salient\nimage pixels. We also perform quantitative experiments and observe that our\nmethod significantly outperforms the counterparts. Code:\nhttps://github.com/zbr17/SAMP.\n","authors":["Borui Zhang","Wenzhao Zheng","Jie Zhou","Jiwen Lu"],"pdf_url":"https://arxiv.org/pdf/2401.10442v1.pdf","comment":"ICLR 2024 accepted"},{"id":"http://arxiv.org/abs/2210.02672v3","updated":"2024-01-19T00:57:05Z","published":"2022-10-06T04:30:59Z","title":"A Novel Maximum-Entropy-Driven Technique for Low-Rank Orthogonal\n Nonnegative Matrix Factorization with $\\ell_0$-Norm sparsity Constraint","summary":" In data-driven control and machine learning, a common requirement involves\nbreaking down large matrices into smaller, low-rank factors that possess\nspecific levels of sparsity. This paper introduces an innovative solution to\nthe orthogonal nonnegative matrix factorization (ONMF) problem. The objective\nis to approximate input data by using two low-rank nonnegative matrices,\nadhering to both orthogonality and $\\ell_0$-norm sparsity constraints. the\nproposed maximum-entropy-principle based framework ensures orthogonality and\nsparsity of features or the mixing matrix, while maintaining nonnegativity in\nboth. Additionally, the methodology offers a quantitative determination of the\n``true'' number of underlying features, a crucial hyperparameter for ONMF.\nExperimental evaluation on synthetic and a standard datasets highlights the\nmethod's superiority in terms of sparsity, orthogonality, and computational\nspeed compared to existing approaches. Notably, the proposed method achieves\ncomparable or improved reconstruction errors in line with the literature.\n","authors":["Salar Basiri","Srinivasa Salapaka"],"pdf_url":"https://arxiv.org/pdf/2210.02672v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.00110v3","updated":"2024-01-19T00:35:35Z","published":"2023-12-30T01:24:25Z","title":"Diffusion Model with Perceptual Loss","summary":" Diffusion models trained with mean squared error loss tend to generate\nunrealistic samples. Current state-of-the-art models rely on classifier-free\nguidance to improve sample quality, yet its surprising effectiveness is not\nfully understood. In this paper, we show that the effectiveness of\nclassifier-free guidance partly originates from it being a form of implicit\nperceptual guidance. As a result, we can directly incorporate perceptual loss\nin diffusion training to improve sample quality. Since the score matching\nobjective used in diffusion training strongly resembles the denoising\nautoencoder objective used in unsupervised training of perceptual networks, the\ndiffusion model itself is a perceptual network and can be used to generate\nmeaningful perceptual loss. We propose a novel self-perceptual objective that\nresults in diffusion models capable of generating more realistic samples. For\nconditional generation, our method only improves sample quality without\nentanglement with the conditional input and therefore does not sacrifice sample\ndiversity. Our method can also improve sample quality for unconditional\ngeneration, which was not possible with classifier-free guidance before.\n","authors":["Shanchuan Lin","Xiao Yang"],"pdf_url":"https://arxiv.org/pdf/2401.00110v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.07988v3","updated":"2024-01-19T00:28:45Z","published":"2023-09-14T19:01:08Z","title":"Folding Attention: Memory and Power Optimization for On-Device\n Transformer-based Streaming Speech Recognition","summary":" Transformer-based models excel in speech recognition. Existing efforts to\noptimize Transformer inference, typically for long-context applications, center\non simplifying attention score calculations. However, streaming speech\nrecognition models usually process a limited number of tokens each time, making\nattention score calculation less of a bottleneck. Instead, the bottleneck lies\nin the linear projection layers of multi-head attention and feedforward\nnetworks, constituting a substantial portion of the model size and contributing\nsignificantly to computation, memory, and power usage.\n To address this bottleneck, we propose folding attention, a technique\ntargeting these linear layers, significantly reducing model size and improving\nmemory and power efficiency. Experiments on on-device Transformer-based\nstreaming speech recognition models show that folding attention reduces model\nsize (and corresponding memory consumption) by up to 24% and power consumption\nby up to 23%, all without compromising model accuracy or computation overhead.\n","authors":["Yang Li","Liangzhen Lai","Yuan Shangguan","Forrest N. Iandola","Zhaoheng Ni","Ernie Chang","Yangyang Shi","Vikas Chandra"],"pdf_url":"https://arxiv.org/pdf/2309.07988v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10432v1","updated":"2024-01-19T00:27:34Z","published":"2024-01-19T00:27:34Z","title":"A2Q+: Improving Accumulator-Aware Weight Quantization","summary":" Quantization techniques commonly reduce the inference costs of neural\nnetworks by restricting the precision of weights and activations. Recent\nstudies show that also reducing the precision of the accumulator can further\nimprove hardware efficiency at the risk of numerical overflow, which introduces\narithmetic errors that can degrade model accuracy. To avoid numerical overflow\nwhile maintaining accuracy, recent work proposed accumulator-aware quantization\n(A2Q), a quantization-aware training method that constrains model weights\nduring training to safely use a target accumulator bit width during inference.\nAlthough this shows promise, we demonstrate that A2Q relies on an overly\nrestrictive constraint and a sub-optimal weight initialization strategy that\neach introduce superfluous quantization error. To address these shortcomings,\nwe introduce: (1) an improved bound that alleviates accumulator constraints\nwithout compromising overflow avoidance; and (2) a new strategy for\ninitializing quantized weights from pre-trained floating-point checkpoints. We\ncombine these contributions with weight normalization to introduce A2Q+. We\nsupport our analysis with experiments that show A2Q+ significantly improves the\ntrade-off between accumulator bit width and model accuracy and characterize new\ntrade-offs that arise as a consequence of accumulator constraints.\n","authors":["Ian Colbert","Alessandro Pappalardo","Jakoba Petri-Koenig","Yaman Umuroglu"],"pdf_url":"https://arxiv.org/pdf/2401.10432v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2206.01409v4","updated":"2024-01-19T00:23:28Z","published":"2022-06-03T06:34:09Z","title":"Hybrid Parameter Search and Dynamic Model Selection for Mixed-Variable\n Bayesian Optimization","summary":" This paper presents a new type of hybrid model for Bayesian optimization (BO)\nadept at managing mixed variables, encompassing both quantitative (continuous\nand integer) and qualitative (categorical) types. Our proposed new hybrid\nmodels (named hybridM) merge the Monte Carlo Tree Search structure (MCTS) for\ncategorical variables with Gaussian Processes (GP) for continuous ones. hybridM\nleverages the upper confidence bound tree search (UCTS) for MCTS strategy,\nshowcasing the tree architecture's integration into Bayesian optimization. Our\ninnovations, including dynamic online kernel selection in the surrogate\nmodeling phase and a unique UCTS search strategy, position our hybrid models as\nan advancement in mixed-variable surrogate models. Numerical experiments\nunderscore the superiority of hybrid models, highlighting their potential in\nBayesian optimization.\n","authors":["Hengrui Luo","Younghyun Cho","James W. Demmel","Xiaoye S. Li","Yang Liu"],"pdf_url":"https://arxiv.org/pdf/2206.01409v4.pdf","comment":"33 pages, 8 Figures"},{"id":"http://arxiv.org/abs/2305.14402v3","updated":"2024-01-19T00:16:49Z","published":"2023-05-23T10:16:08Z","title":"Enhancing Speech Emotion Recognition Through Differentiable Architecture\n Search","summary":" Speech Emotion Recognition (SER) is a critical enabler of emotion-aware\ncommunication in human-computer interactions. Recent advancements in Deep\nLearning (DL) have substantially enhanced the performance of SER models through\nincreased model complexity. However, designing optimal DL architectures\nrequires prior experience and experimental evaluations. Encouragingly, Neural\nArchitecture Search (NAS) offers a promising avenue to determine an optimal DL\nmodel automatically. In particular, Differentiable Architecture Search (DARTS)\nis an efficient method of using NAS to search for optimised models. This paper\nproposes a DARTS-optimised joint CNN and LSTM architecture, to improve SER\nperformance, where the literature informs the selection of CNN and LSTM\ncoupling to offer improved performance. While DARTS has previously been applied\nto CNN and LSTM combinations, our approach introduces a novel mechanism,\nparticularly in selecting CNN operations using DARTS. In contrast to previous\nstudies, we refrain from imposing constraints on the order of the layers for\nthe CNN within the DARTS cell; instead, we allow DARTS to determine the optimal\nlayer order autonomously. Experimenting with the IEMOCAP and MSP-IMPROV\ndatasets, we demonstrate that our proposed methodology achieves significantly\nhigher SER accuracy than hand-engineering the CNN-LSTM configuration. It also\noutperforms the best-reported SER results achieved using DARTS on CNN-LSTM.\n","authors":["Thejan Rajapakshe","Rajib Rana","Sara Khalifa","Berrak Sisman","Björn Schuller"],"pdf_url":"https://arxiv.org/pdf/2305.14402v3.pdf","comment":"5 pages, 4 figures"}],"Multimedia":[{"id":"http://arxiv.org/abs/2401.10608v1","updated":"2024-01-19T10:37:27Z","published":"2024-01-19T10:37:27Z","title":"M2ORT: Many-To-One Regression Transformer for Spatial Transcriptomics\n Prediction from Histopathology Images","summary":" The advancement of Spatial Transcriptomics (ST) has facilitated the\nspatially-aware profiling of gene expressions based on histopathology images.\nAlthough ST data offers valuable insights into the micro-environment of tumors,\nits acquisition cost remains expensive. Therefore, directly predicting the ST\nexpressions from digital pathology images is desired. Current methods usually\nadopt existing regression backbones for this task, which ignore the inherent\nmulti-scale hierarchical data structure of digital pathology images. To address\nthis limit, we propose M2ORT, a many-to-one regression Transformer that can\naccommodate the hierarchical structure of the pathology images through a\ndecoupled multi-scale feature extractor. Different from traditional models that\nare trained with one-to-one image-label pairs, M2ORT accepts multiple pathology\nimages of different magnifications at a time to jointly predict the gene\nexpressions at their corresponding common ST spot, aiming at learning a\nmany-to-one relationship through training. We have tested M2ORT on three public\nST datasets and the experimental results show that M2ORT can achieve\nstate-of-the-art performance with fewer parameters and floating-point\noperations (FLOPs). The code is available at:\nhttps://github.com/Dootmaan/M2ORT/.\n","authors":["Hongyi Wang","Xiuju Du","Jing Liu","Shuyi Ouyang","Yen-Wei Chen","Lanfen Lin"],"pdf_url":"https://arxiv.org/pdf/2401.10608v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10475v1","updated":"2024-01-19T03:54:58Z","published":"2024-01-19T03:54:58Z","title":"CBVS: A Large-Scale Chinese Image-Text Benchmark for Real-World Short\n Video Search Scenarios","summary":" Vision-Language Models pre-trained on large-scale image-text datasets have\nshown superior performance in downstream tasks such as image retrieval. Most of\nthe images for pre-training are presented in the form of open domain\ncommon-sense visual elements. Differently, video covers in short video search\nscenarios are presented as user-originated contents that provide important\nvisual summaries of videos. In addition, a portion of the video covers come\nwith manually designed cover texts that provide semantic complements. In order\nto fill in the gaps in short video cover data, we establish the first\nlarge-scale cover-text benchmark for Chinese short video search scenarios.\nSpecifically, we release two large-scale datasets CBVS-5M/10M to provide short\nvideo covers, and the manual fine-labeling dataset CBVS-20K to provide real\nuser queries, which serves as an image-text benchmark test in the Chinese short\nvideo search field. To integrate the semantics of cover text in the case of\nmodality missing, we propose UniCLIP where cover texts play a guiding role\nduring training, however are not relied upon by inference. Extensive evaluation\non CBVS-20K demonstrates the excellent performance of our proposal. UniCLIP has\nbeen deployed to Tencent's online video search systems with hundreds of\nmillions of visits and achieved significant gains. The complete dataset, code\nand checkpoints will be available upon release.\n","authors":["Xiangshuo Qiao","Xianxin Li","Xiaozhe Qu","Jie Zhang","Yang Liu","Yu Luo","Cihang Jin","Jin Ma"],"pdf_url":"https://arxiv.org/pdf/2401.10475v1.pdf","comment":null}]},"2024-01-22T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2401.06766v2","updated":"2024-01-22T18:55:35Z","published":"2024-01-12T18:58:26Z","title":"Mind Your Format: Towards Consistent Evaluation of In-Context Learning\n Improvements","summary":" Large language models demonstrate a remarkable capability for learning to\nsolve new tasks from a few examples. The prompt template, or the way the input\nexamples are formatted to obtain the prompt, is an important yet often\noverlooked aspect of in-context learning. In this work, we conduct a\ncomprehensive study of the template format's influence on the in-context\nlearning performance. We evaluate the impact of the prompt template across\nmodels (from 770M to 70B parameters) and 4 standard classification datasets. We\nshow that a poor choice of the template can reduce the performance of the\nstrongest models and inference methods to a random guess level. More\nimportantly, the best templates do not transfer between different setups and\neven between models of the same family. Our findings show that the currently\nprevalent approach to evaluation, which ignores template selection, may give\nmisleading results due to different templates in different works. As a first\nstep towards mitigating this issue, we propose Template Ensembles that\naggregate model predictions across several templates. This simple test-time\naugmentation boosts average performance while being robust to the choice of\nrandom set of templates.\n","authors":["Anton Voronov","Lena Wolf","Max Ryabinin"],"pdf_url":"https://arxiv.org/pdf/2401.06766v2.pdf","comment":"21 pages, 10 figures. Code:\n https://github.com/yandex-research/mind-your-format"},{"id":"http://arxiv.org/abs/2401.12208v1","updated":"2024-01-22T18:51:07Z","published":"2024-01-22T18:51:07Z","title":"CheXagent: Towards a Foundation Model for Chest X-Ray Interpretation","summary":" Chest X-rays (CXRs) are the most frequently performed imaging test in\nclinical practice. Recent advances in the development of vision-language\nfoundation models (FMs) give rise to the possibility of performing automated\nCXR interpretation, which can assist physicians with clinical decision-making\nand improve patient outcomes. However, developing FMs that can accurately\ninterpret CXRs is challenging due to the (1) limited availability of\nlarge-scale vision-language datasets in the medical image domain, (2) lack of\nvision and language encoders that can capture the complexities of medical data,\nand (3) absence of evaluation frameworks for benchmarking the abilities of FMs\non CXR interpretation. In this work, we address these challenges by first\nintroducing \\emph{CheXinstruct} - a large-scale instruction-tuning dataset\ncurated from 28 publicly-available datasets. We then present \\emph{CheXagent} -\nan instruction-tuned FM capable of analyzing and summarizing CXRs. To build\nCheXagent, we design a clinical large language model (LLM) for parsing\nradiology reports, a vision encoder for representing CXR images, and a network\nto bridge the vision and language modalities. Finally, we introduce\n\\emph{CheXbench} - a novel benchmark designed to systematically evaluate FMs\nacross 8 clinically-relevant CXR interpretation tasks. Extensive quantitative\nevaluations and qualitative reviews with five expert radiologists demonstrate\nthat CheXagent outperforms previously-developed general- and medical-domain FMs\non CheXbench tasks. Furthermore, in an effort to improve model transparency, we\nperform a fairness evaluation across factors of sex, race and age to highlight\npotential performance disparities. Our project is at\n\\url{https://stanford-aimi.github.io/chexagent.html}.\n","authors":["Zhihong Chen","Maya Varma","Jean-Benoit Delbrouck","Magdalini Paschali","Louis Blankemeier","Dave Van Veen","Jeya Maria Jose Valanarasu","Alaa Youssef","Joseph Paul Cohen","Eduardo Pontes Reis","Emily B. Tsai","Andrew Johnston","Cameron Olsen","Tanishq Mathew Abraham","Sergios Gatidis","Akshay S. Chaudhari","Curtis Langlotz"],"pdf_url":"https://arxiv.org/pdf/2401.12208v1.pdf","comment":"24 pages, 8 figures"},{"id":"http://arxiv.org/abs/2401.12200v1","updated":"2024-01-22T18:39:40Z","published":"2024-01-22T18:39:40Z","title":"APT: Adaptive Pruning and Tuning Pretrained Language Models for\n Efficient Training and Inference","summary":" Fine-tuning and inference with large Language Models (LM) are generally known\nto be expensive. Parameter-efficient fine-tuning over pretrained LMs reduces\ntraining memory by updating a small number of LM parameters but does not\nimprove inference efficiency. Structured pruning improves LM inference\nefficiency by removing consistent parameter blocks, yet often increases\ntraining memory and time. To improve both training and inference efficiency, we\nintroduce APT that adaptively prunes and tunes parameters for the LMs. At the\nearly stage of fine-tuning, APT dynamically adds salient tuning parameters for\nfast and accurate convergence while discarding unimportant parameters for\nefficiency. Compared to baselines, our experiments show that APT maintains up\nto 98% task performance when pruning RoBERTa and T5 models with 40% parameters\nleft while keeping 86.4% LLaMA models' performance with 70% parameters\nremained. Furthermore, APT speeds up LMs fine-tuning by up to 8x and reduces\nlarge LMs memory training footprint by up to 70%.\n","authors":["Bowen Zhao","Hannaneh Hajishirzi","Qingqing Cao"],"pdf_url":"https://arxiv.org/pdf/2401.12200v1.pdf","comment":"19 pages, 6 figures"},{"id":"http://arxiv.org/abs/2401.12192v1","updated":"2024-01-22T18:34:42Z","published":"2024-01-22T18:34:42Z","title":"Text Embedding Inversion Attacks on Multilingual Language Models","summary":" Representing textual information as real-numbered embeddings has become the\nnorm in NLP. Moreover, with the rise of public interest in large language\nmodels (LLMs), Embeddings as a Service (EaaS) has rapidly gained traction as a\nbusiness model. This is not without outstanding security risks, as previous\nresearch has demonstrated that sensitive data can be reconstructed from\nembeddings, even without knowledge of the underlying model that generated them.\nHowever, such work is limited by its sole focus on English, leaving all other\nlanguages vulnerable to attacks by malicious actors. %As many international and\nmultilingual companies leverage EaaS, there is an urgent need for research into\nmultilingual LLM security. To this end, this work investigates LLM security\nfrom the perspective of multilingual embedding inversion. Concretely, we define\nthe problem of black-box multilingual and cross-lingual inversion attacks, with\nspecial attention to a cross-domain scenario. Our findings reveal that\nmultilingual models are potentially more vulnerable to inversion attacks than\ntheir monolingual counterparts. This stems from the reduced data requirements\nfor achieving comparable inversion performance in settings where the underlying\nlanguage is not known a-priori. To our knowledge, this work is the first to\ndelve into multilinguality within the context of inversion attacks, and our\nfindings highlight the need for further investigation and enhanced defenses in\nthe area of NLP Security.\n","authors":["Yiyi Chen","Heather Lent","Johannes Bjerva"],"pdf_url":"https://arxiv.org/pdf/2401.12192v1.pdf","comment":"13 pages"},{"id":"http://arxiv.org/abs/2401.12187v1","updated":"2024-01-22T18:27:08Z","published":"2024-01-22T18:27:08Z","title":"WARM: On the Benefits of Weight Averaged Reward Models","summary":" Aligning large language models (LLMs) with human preferences through\nreinforcement learning (RLHF) can lead to reward hacking, where LLMs exploit\nfailures in the reward model (RM) to achieve seemingly high rewards without\nmeeting the underlying objectives. We identify two primary challenges when\ndesigning RMs to mitigate reward hacking: distribution shifts during the RL\nprocess and inconsistencies in human preferences. As a solution, we propose\nWeight Averaged Reward Models (WARM), first fine-tuning multiple RMs, then\naveraging them in the weight space. This strategy follows the observation that\nfine-tuned weights remain linearly mode connected when sharing the same\npre-training. By averaging weights, WARM improves efficiency compared to the\ntraditional ensembling of predictions, while improving reliability under\ndistribution shifts and robustness to preference inconsistencies. Our\nexperiments on summarization tasks, using best-of-N and RL methods, shows that\nWARM improves the overall quality and alignment of LLM predictions; for\nexample, a policy RL fine-tuned with WARM has a 79.4% win rate against a policy\nRL fine-tuned with a single RM.\n","authors":["Alexandre Ramé","Nino Vieillard","Léonard Hussenot","Robert Dadashi","Geoffrey Cideron","Olivier Bachem","Johan Ferret"],"pdf_url":"https://arxiv.org/pdf/2401.12187v1.pdf","comment":"14 pages, 9 figures"},{"id":"http://arxiv.org/abs/2401.12181v1","updated":"2024-01-22T18:11:01Z","published":"2024-01-22T18:11:01Z","title":"Universal Neurons in GPT2 Language Models","summary":" A basic question within the emerging field of mechanistic interpretability is\nthe degree to which neural networks learn the same underlying mechanisms. In\nother words, are neural mechanisms universal across different models? In this\nwork, we study the universality of individual neurons across GPT2 models\ntrained from different initial random seeds, motivated by the hypothesis that\nuniversal neurons are likely to be interpretable. In particular, we compute\npairwise correlations of neuron activations over 100 million tokens for every\nneuron pair across five different seeds and find that 1-5\\% of neurons are\nuniversal, that is, pairs of neurons which consistently activate on the same\ninputs. We then study these universal neurons in detail, finding that they\nusually have clear interpretations and taxonomize them into a small number of\nneuron families. We conclude by studying patterns in neuron weights to\nestablish several universal functional roles of neurons in simple circuits:\ndeactivating attention heads, changing the entropy of the next token\ndistribution, and predicting the next token to (not) be within a particular\nset.\n","authors":["Wes Gurnee","Theo Horsley","Zifan Carl Guo","Tara Rezaei Kheirkhah","Qinyi Sun","Will Hathaway","Neel Nanda","Dimitris Bertsimas"],"pdf_url":"https://arxiv.org/pdf/2401.12181v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12178v1","updated":"2024-01-22T18:09:52Z","published":"2024-01-22T18:09:52Z","title":"In-Context Learning for Extreme Multi-Label Classification","summary":" Multi-label classification problems with thousands of classes are hard to\nsolve with in-context learning alone, as language models (LMs) might lack prior\nknowledge about the precise classes or how to assign them, and it is generally\ninfeasible to demonstrate every class in a prompt. We propose a general\nprogram, $\\texttt{Infer--Retrieve--Rank}$, that defines multi-step interactions\nbetween LMs and retrievers to efficiently tackle such problems. We implement\nthis program using the $\\texttt{DSPy}$ programming model, which specifies\nin-context systems in a declarative manner, and use $\\texttt{DSPy}$ optimizers\nto tune it towards specific datasets by bootstrapping only tens of few-shot\nexamples. Our primary extreme classification program, optimized separately for\neach task, attains state-of-the-art results across three benchmarks (HOUSE,\nTECH, TECHWOLF). We apply the same program to a benchmark with vastly different\ncharacteristics and attain competitive performance as well (BioDEX). Unlike\nprior work, our proposed solution requires no finetuning, is easily applicable\nto new tasks, alleviates prompt engineering, and requires only tens of labeled\nexamples. Our code is public at https://github.com/KarelDO/xmc.dspy.\n","authors":["Karel D'Oosterlinck","Omar Khattab","François Remy","Thomas Demeester","Chris Develder","Christopher Potts"],"pdf_url":"https://arxiv.org/pdf/2401.12178v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12168v1","updated":"2024-01-22T18:01:01Z","published":"2024-01-22T18:01:01Z","title":"SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning\n Capabilities","summary":" Understanding and reasoning about spatial relationships is a fundamental\ncapability for Visual Question Answering (VQA) and robotics. While Vision\nLanguage Models (VLM) have demonstrated remarkable performance in certain VQA\nbenchmarks, they still lack capabilities in 3D spatial reasoning, such as\nrecognizing quantitative relationships of physical objects like distances or\nsize differences. We hypothesize that VLMs' limited spatial reasoning\ncapability is due to the lack of 3D spatial knowledge in training data and aim\nto solve this problem by training VLMs with Internet-scale spatial reasoning\ndata. To this end, we present a system to facilitate this approach. We first\ndevelop an automatic 3D spatial VQA data generation framework that scales up to\n2 billion VQA examples on 10 million real-world images. We then investigate\nvarious factors in the training recipe, including data quality, training\npipeline, and VLM architecture. Our work features the first internet-scale 3D\nspatial reasoning dataset in metric space. By training a VLM on such data, we\nsignificantly enhance its ability on both qualitative and quantitative spatial\nVQA. Finally, we demonstrate that this VLM unlocks novel downstream\napplications in chain-of-thought spatial reasoning and robotics due to its\nquantitative estimation capability. Project website:\nhttps://spatial-vlm.github.io/\n","authors":["Boyuan Chen","Zhuo Xu","Sean Kirmani","Brian Ichter","Danny Driess","Pete Florence","Dorsa Sadigh","Leonidas Guibas","Fei Xia"],"pdf_url":"https://arxiv.org/pdf/2401.12168v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12143v1","updated":"2024-01-22T17:26:55Z","published":"2024-01-22T17:26:55Z","title":"Anisotropy Is Inherent to Self-Attention in Transformers","summary":" The representation degeneration problem is a phenomenon that is widely\nobserved among self-supervised learning methods based on Transformers. In NLP,\nit takes the form of anisotropy, a singular property of hidden representations\nwhich makes them unexpectedly close to each other in terms of angular distance\n(cosine-similarity). Some recent works tend to show that anisotropy is a\nconsequence of optimizing the cross-entropy loss on long-tailed distributions\nof tokens. We show in this paper that anisotropy can also be observed\nempirically in language models with specific objectives that should not suffer\ndirectly from the same consequences. We also show that the anisotropy problem\nextends to Transformers trained on other modalities. Our observations suggest\nthat anisotropy is actually inherent to Transformers-based models.\n","authors":["Nathan Godey","Éric de la Clergerie","Benoît Sagot"],"pdf_url":"https://arxiv.org/pdf/2401.12143v1.pdf","comment":"Proceedings of EACL 2024. Previously presented at ACL-SRW 2023\n (arXiv:2306.07656). arXiv admin note: substantial text overlap with\n arXiv:2306.07656"},{"id":"http://arxiv.org/abs/2401.10491v2","updated":"2024-01-22T17:16:37Z","published":"2024-01-19T05:02:46Z","title":"Knowledge Fusion of Large Language Models","summary":" While training large language models (LLMs) from scratch can generate models\nwith distinct functionalities and strengths, it comes at significant costs and\nmay result in redundant capabilities. Alternatively, a cost-effective and\ncompelling approach is to merge existing pre-trained LLMs into a more potent\nmodel. However, due to the varying architectures of these LLMs, directly\nblending their weights is impractical. In this paper, we introduce the notion\nof knowledge fusion for LLMs, aimed at combining the capabilities of existing\nLLMs and transferring them into a single LLM. By leveraging the generative\ndistributions of source LLMs, we externalize their collective knowledge and\nunique strengths, thereby potentially elevating the capabilities of the target\nmodel beyond those of any individual source LLM. We validate our approach using\nthree popular LLMs with different architectures--Llama-2, MPT, and\nOpenLLaMA--across various benchmarks and tasks. Our findings confirm that the\nfusion of LLMs can improve the performance of the target model across a range\nof capabilities such as reasoning, commonsense, and code generation. Our code,\nmodel weights, and data are public at\n\\url{https://github.com/fanqiwan/FuseLLM}.\n","authors":["Fanqi Wan","Xinting Huang","Deng Cai","Xiaojun Quan","Wei Bi","Shuming Shi"],"pdf_url":"https://arxiv.org/pdf/2401.10491v2.pdf","comment":"Accepted to ICLR 2024"},{"id":"http://arxiv.org/abs/2304.14317v2","updated":"2024-01-22T17:06:50Z","published":"2023-04-27T16:38:17Z","title":"ICE-Score: Instructing Large Language Models to Evaluate Code","summary":" Recent advancements in the field of natural language generation have\nfacilitated the use of large language models to assess the quality of generated\ntext. Although these models have shown promising results in tasks such as\nmachine translation and summarization, their applicability in code intelligence\ntasks remains limited without human involvement. The complexity of programming\nconcepts required for such tasks makes it difficult to develop evaluation\nmetrics that align with human judgment. Token-matching-based metrics, such as\nBLEU, have demonstrated weak correlations with human practitioners in code\nintelligence tasks. Moreover, utilizing human-written test suites to evaluate\nfunctional correctness can be challenging in domains with low resources. To\novercome these obstacles, we propose \\texttt{ICE-Score}, a new evaluation\nmetric via instructing large language models (LLMs) for code assessments. Our\nmetric addresses the limitations of existing approaches by achieving superior\ncorrelations with functional correctness and human preferences, without the\nneed for test oracles or references. We evaluate the efficacy of our metric on\ntwo different aspects (\\textit{human preference} and \\textit{execution\nsuccess}) and four programming languages. Our results demonstrate that our\nmetric surpasses state-of-the-art metrics for code generation, delivering high\nlevels of accuracy and consistency across various programming languages and\ntasks. We also make our evaluation metric and datasets available to the\npublic\\footnote{\\url{https://github.com/terryyz/ice-score}}, encouraging\nfurther research in evaluating code intelligence tasks.\n","authors":["Terry Yue Zhuo"],"pdf_url":"https://arxiv.org/pdf/2304.14317v2.pdf","comment":"Accepted to Findings of EACL 2024"},{"id":"http://arxiv.org/abs/2401.12117v1","updated":"2024-01-22T16:57:05Z","published":"2024-01-22T16:57:05Z","title":"The Curious Case of Nonverbal Abstract Reasoning with Multi-Modal Large\n Language Models","summary":" While large language models (LLMs) are still being adopted to new domains and\nutilized in novel applications, we are experiencing an influx of the new\ngeneration of foundation models, namely multi-modal large language models\n(MLLMs). These models integrate verbal and visual information, opening new\npossibilities to demonstrate more complex reasoning abilities at the\nintersection of the two modalities. However, despite the revolutionizing\nprospect of MLLMs, our understanding of their reasoning abilities is limited.\nIn this study, we assess the nonverbal abstract reasoning abilities of\nopen-source and closed-source MLLMs using variations of Raven's Progressive\nMatrices. Our experiments expose the difficulty of solving such problems while\nshowcasing the immense gap between open-source and closed-source models. We\nalso reveal critical shortcomings with individual visual and textual modules,\nsubjecting the models to low-performance ceilings. Finally, to improve MLLMs'\nperformance, we experiment with various methods, such as Chain-of-Thought\nprompting, resulting in a significant (up to 100%) boost in performance.\n","authors":["Kian Ahrabian","Zhivar Sourati","Kexuan Sun","Jiarui Zhang","Yifan Jiang","Fred Morstatter","Jay Pujara"],"pdf_url":"https://arxiv.org/pdf/2401.12117v1.pdf","comment":"Code and datasets are available at\n https://github.com/kahrabian/mllm-nvar"},{"id":"http://arxiv.org/abs/2401.12097v1","updated":"2024-01-22T16:35:00Z","published":"2024-01-22T16:35:00Z","title":"An Empirical Analysis of In-context Learning Abilities of LLMs for MT","summary":" In-context learning (ICL) has consistently demonstrated superior performance\nover zero-shot performance in large language models (LLMs). However, the\nunderstanding of the dynamics of ICL and the aspects that influence downstream\nperformance remains limited, especially for natural language generation (NLG)\ntasks. This work aims to address this gap by investigating the ICL capabilities\nof LLMs and studying the impact of different aspects of the in-context\ndemonstrations for the task of machine translation (MT). Our preliminary\ninvestigations aim to discern whether in-context learning (ICL) is\npredominantly influenced by demonstrations or instructions by applying diverse\nperturbations to in-context demonstrations while preserving the task\ninstruction. We observe varying behavior to perturbed examples across different\nmodel families, notably with BLOOM-7B derivatives being severely influenced by\nnoise, whereas Llama 2 derivatives not only exhibit robustness but also tend to\nshow enhancements over the clean baseline when subject to perturbed\ndemonstrations. This suggests that the robustness of ICL may be governed by\nseveral factors, including the type of noise, perturbation direction (source or\ntarget), the extent of pretraining of the specific model, and fine-tuning for\ndownstream tasks if applicable. Further investigation is warranted to develop a\ncomprehensive understanding of these factors in future research.\n","authors":["Pranjal A. Chitale","Jay Gala","Varun Gumma","Mitesh M. Khapra","Raj Dabre"],"pdf_url":"https://arxiv.org/pdf/2401.12097v1.pdf","comment":"Work in progress"},{"id":"http://arxiv.org/abs/2401.12088v1","updated":"2024-01-22T16:25:47Z","published":"2024-01-22T16:25:47Z","title":"Unsupervised Learning of Graph from Recipes","summary":" Cooking recipes are one of the most readily available kinds of procedural\ntext. They consist of natural language instructions that can be challenging to\ninterpret. In this paper, we propose a model to identify relevant information\nfrom recipes and generate a graph to represent the sequence of actions in the\nrecipe. In contrast with other approaches, we use an unsupervised approach. We\niteratively learn the graph structure and the parameters of a $\\mathsf{GNN}$\nencoding the texts (text-to-graph) one sequence at a time while providing the\nsupervision by decoding the graph into text (graph-to-text) and comparing the\ngenerated text to the input. We evaluate the approach by comparing the\nidentified entities with annotated datasets, comparing the difference between\nthe input and output texts, and comparing our generated graphs with those\ngenerated by state of the art methods.\n","authors":["Aissatou Diallo","Antonis Bikakis","Luke Dickens","Anthony Hunter","Rob Miller"],"pdf_url":"https://arxiv.org/pdf/2401.12088v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12087v1","updated":"2024-01-22T16:25:27Z","published":"2024-01-22T16:25:27Z","title":"Revisiting Demonstration Selection Strategies in In-Context Learning","summary":" Large language models (LLMs) have shown an impressive ability to perform a\nwide range of tasks using in-context learning (ICL), where a few examples are\nused to describe a task to the model. However, the performance of ICL varies\nsignificantly with the choice of demonstrations, and it is still unclear why\nthis happens or what factors will influence its choice. In this work, we first\nrevisit the factors contributing to this variance from both data and model\naspects, and find that the choice of demonstration is both data- and\nmodel-dependent. We further proposed a data- and model-dependent demonstration\nselection method, \\textbf{TopK + ConE}, based on the assumption that\n\\textit{the performance of a demonstration positively correlates with its\ncontribution to the model's understanding of the test samples}, resulting in a\nsimple and effective recipe for ICL. Empirically, our method yields consistent\nimprovements in both language understanding and generation tasks with different\nmodel scales. Further analyses confirm that, besides the generality and\nstability under different circumstances, our method provides a unified\nexplanation for the effectiveness of previous methods. Code will be released.\n","authors":["Keqin Peng","Liang Ding","Yancheng Yuan","Xuebo Liu","Min Zhang","Yuanxin Ouyang","Dacheng Tao"],"pdf_url":"https://arxiv.org/pdf/2401.12087v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12086v1","updated":"2024-01-22T16:24:43Z","published":"2024-01-22T16:24:43Z","title":"West-of-N: Synthetic Preference Generation for Improved Reward Modeling","summary":" The success of reinforcement learning from human feedback (RLHF) in language\nmodel alignment is strongly dependent on the quality of the underlying reward\nmodel. In this paper, we present a novel approach to improve reward model\nquality by generating synthetic preference data, thereby augmenting the\ntraining dataset with on-policy, high-quality preference pairs. Motivated by\nthe promising results of Best-of-N sampling strategies in language model\ntraining, we extend their application to reward model training. This results in\na self-training strategy to generate preference pairs by selecting the best and\nworst candidates in a pool of responses to a given query. Empirically, we find\nthat this approach improves the performance of any reward model, with an effect\ncomparable to the addition of a similar quantity of human preference data. This\nwork opens up new avenues of research for improving RLHF for language model\nalignment, by offering synthetic preference generation as a solution to reward\nmodeling challenges.\n","authors":["Alizée Pace","Jonathan Mallinson","Eric Malmi","Sebastian Krause","Aliaksei Severyn"],"pdf_url":"https://arxiv.org/pdf/2401.12086v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12078v1","updated":"2024-01-22T16:20:14Z","published":"2024-01-22T16:20:14Z","title":"Temporal Blind Spots in Large Language Models","summary":" Large language models (LLMs) have recently gained significant attention due\nto their unparalleled ability to perform various natural language processing\ntasks. These models, benefiting from their advanced natural language\nunderstanding capabilities, have demonstrated impressive zero-shot performance.\nHowever, the pre-training data utilized in LLMs is often confined to a specific\ncorpus, resulting in inherent freshness and temporal scope limitations.\nConsequently, this raises concerns regarding the effectiveness of LLMs for\ntasks involving temporal intents. In this study, we aim to investigate the\nunderlying limitations of general-purpose LLMs when deployed for tasks that\nrequire a temporal understanding. We pay particular attention to handling\nfactual temporal knowledge through three popular temporal QA datasets.\nSpecifically, we observe low performance on detailed questions about the past\nand, surprisingly, for rather new information. In manual and automatic testing,\nwe find multiple temporal errors and characterize the conditions under which QA\nperformance deteriorates. Our analysis contributes to understanding LLM\nlimitations and offers valuable insights into developing future models that can\nbetter cater to the demands of temporally-oriented tasks. The code is\navailable\\footnote{https://github.com/jwallat/temporalblindspots}.\n","authors":["Jonas Wallat","Adam Jatowt","Avishek Anand"],"pdf_url":"https://arxiv.org/pdf/2401.12078v1.pdf","comment":"accepted at WSDM'24"},{"id":"http://arxiv.org/abs/2401.12072v1","updated":"2024-01-22T16:13:45Z","published":"2024-01-22T16:13:45Z","title":"Cross-lingual Transfer Learning for Javanese Dependency Parsing","summary":" While structure learning achieves remarkable performance in high-resource\nlanguages, the situation differs for under-represented languages due to the\nscarcity of annotated data. This study focuses on assessing the efficacy of\ntransfer learning in enhancing dependency parsing for Javanese, a language\nspoken by 80 million individuals but characterized by limited representation in\nnatural language processing. We utilized the Universal Dependencies dataset\nconsisting of dependency treebanks from more than 100 languages, including\nJavanese. We propose two learning strategies to train the model: transfer\nlearning (TL) and hierarchical transfer learning (HTL). While TL only uses a\nsource language to pre-train the model, the HTL method uses a source language\nand an intermediate language in the learning process. The results show that our\nbest model uses the HTL method, which improves performance with an increase of\n10% for both UAS and LAS evaluations compared to the baseline model.\n","authors":["Fadli Aulawi Al Ghiffari","Ika Alfina","Kurniawati Azizah"],"pdf_url":"https://arxiv.org/pdf/2401.12072v1.pdf","comment":"Accepted at IJCNLP-AACL 2023 SRW"},{"id":"http://arxiv.org/abs/2401.12070v1","updated":"2024-01-22T16:09:47Z","published":"2024-01-22T16:09:47Z","title":"Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated\n Text","summary":" Detecting text generated by modern large language models is thought to be\nhard, as both LLMs and humans can exhibit a wide range of complex behaviors.\nHowever, we find that a score based on contrasting two closely related language\nmodels is highly accurate at separating human-generated and machine-generated\ntext. Based on this mechanism, we propose a novel LLM detector that only\nrequires simple calculations using a pair of pre-trained LLMs. The method,\ncalled Binoculars, achieves state-of-the-art accuracy without any training\ndata. It is capable of spotting machine text from a range of modern LLMs\nwithout any model-specific modifications. We comprehensively evaluate\nBinoculars on a number of text sources and in varied situations. Over a wide\nrange of document types, Binoculars detects over 90% of generated samples from\nChatGPT (and other LLMs) at a false positive rate of 0.01%, despite not being\ntrained on any ChatGPT data.\n","authors":["Abhimanyu Hans","Avi Schwarzschild","Valeriia Cherepanova","Hamid Kazemi","Aniruddha Saha","Micah Goldblum","Jonas Geiping","Tom Goldstein"],"pdf_url":"https://arxiv.org/pdf/2401.12070v1.pdf","comment":"20 pages, code available at https://github.com/ahans30/Binoculars"},{"id":"http://arxiv.org/abs/2311.14212v3","updated":"2024-01-22T15:05:30Z","published":"2023-11-23T21:54:22Z","title":"Annotation Sensitivity: Training Data Collection Methods Affect Model\n Performance","summary":" When training data are collected from human annotators, the design of the\nannotation instrument, the instructions given to annotators, the\ncharacteristics of the annotators, and their interactions can impact training\ndata. This study demonstrates that design choices made when creating an\nannotation instrument also impact the models trained on the resulting\nannotations. We introduce the term annotation sensitivity to refer to the\nimpact of annotation data collection methods on the annotations themselves and\non downstream model performance and predictions. We collect annotations of hate\nspeech and offensive language in five experimental conditions of an annotation\ninstrument, randomly assigning annotators to conditions. We then fine-tune BERT\nmodels on each of the five resulting datasets and evaluate model performance on\na holdout portion of each condition. We find considerable differences between\nthe conditions for 1) the share of hate speech/offensive language annotations,\n2) model performance, 3) model predictions, and 4) model learning curves. Our\nresults emphasize the crucial role played by the annotation instrument which\nhas received little attention in the machine learning literature. We call for\nadditional research into how and why the instrument impacts the annotations to\ninform the development of best practices in instrument design.\n","authors":["Christoph Kern","Stephanie Eckman","Jacob Beck","Rob Chew","Bolei Ma","Frauke Kreuter"],"pdf_url":"https://arxiv.org/pdf/2311.14212v3.pdf","comment":"EMNLP 2023 Findings:\n https://aclanthology.org/2023.findings-emnlp.992/"},{"id":"http://arxiv.org/abs/2306.00824v2","updated":"2024-01-22T14:57:47Z","published":"2023-06-01T15:46:36Z","title":"Zero and Few-shot Semantic Parsing with Ambiguous Inputs","summary":" Despite the frequent challenges posed by ambiguity when representing meaning\nvia natural language, it is often ignored or deliberately removed in tasks\nmapping language to formally-designed representations, which generally assume a\none-to-one mapping between linguistic and formal representations. We attempt to\naddress this shortcoming by introducing AmP, a framework, dataset, and\nchallenge for translating ambiguous natural language to formal representations\nlike logic and code. We define templates and generate data for five\nwell-documented linguistic ambiguities. Using AmP, we investigate how several\nfew-shot text-to-code systems handle ambiguity, introducing three new metrics.\nWe find that large pre-trained models perform poorly at capturing the\ndistribution of possible meanings without deliberate instruction. However,\nmodels are able to capture the distribution well when ambiguity is attested in\ntheir inputs. These results motivate a call for including ambiguity explicitly\nin datasets and promote considering the distribution of possible outputs when\nevaluating systems. Data and code: https://github.com/esteng/ambiguous_parsing\n","authors":["Elias Stengel-Eskin","Kyle Rawlins","Benjamin Van Durme"],"pdf_url":"https://arxiv.org/pdf/2306.00824v2.pdf","comment":"ICLR 2024 Camera Ready"},{"id":"http://arxiv.org/abs/2401.12005v1","updated":"2024-01-22T14:53:59Z","published":"2024-01-22T14:53:59Z","title":"ALMs: Authorial Language Models for Authorship Attribution","summary":" In this paper, we introduce an authorship attribution method called Authorial\nLanguage Models (ALMs) that involves identifying the most likely author of a\nquestioned document based on the perplexity of the questioned document\ncalculated for a set of causal language models fine-tuned on the writings of a\nset of candidate author. We benchmarked ALMs against state-of-art-systems using\nthe CCAT50 dataset and the Blogs50 datasets. We find that ALMs achieves a\nmacro-average accuracy score of 83.6% on Blogs50, outperforming all other\nmethods, and 74.9% on CCAT50, matching the performance of the best method. To\nassess the performance of ALMs on shorter texts, we also conducted text\nablation testing. We found that to reach a macro-average accuracy of 70%, ALMs\nneeds 40 tokens on Blogs50 and 400 tokens on CCAT50, while to reach 60% ALMs\nrequires 20 tokens on Blogs50 and 70 tokens on CCAT50.\n","authors":["Weihang Huang","Akira Murakami","Jack Grieve"],"pdf_url":"https://arxiv.org/pdf/2401.12005v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.02118v2","updated":"2024-01-22T14:41:43Z","published":"2023-10-03T14:59:35Z","title":"TWIZ-v2: The Wizard of Multimodal Conversational-Stimulus","summary":" In this report, we describe the vision, challenges, and scientific\ncontributions of the Task Wizard team, TWIZ, in the Alexa Prize TaskBot\nChallenge 2022. Our vision, is to build TWIZ bot as an helpful, multimodal,\nknowledgeable, and engaging assistant that can guide users towards the\nsuccessful completion of complex manual tasks. To achieve this, we focus our\nefforts on three main research questions: (1) Humanly-Shaped Conversations, by\nproviding information in a knowledgeable way; (2) Multimodal Stimulus, making\nuse of various modalities including voice, images, and videos; and (3)\nZero-shot Conversational Flows, to improve the robustness of the interaction to\nunseen scenarios. TWIZ is an assistant capable of supporting a wide range of\ntasks, with several innovative features such as creative cooking, video\nnavigation through voice, and the robust TWIZ-LLM, a Large Language Model\ntrained for dialoguing about complex manual tasks. Given ratings and feedback\nprovided by users, we observed that TWIZ bot is an effective and robust system,\ncapable of guiding users through tasks while providing several multimodal\nstimuli.\n","authors":["Rafael Ferreira","Diogo Tavares","Diogo Silva","Rodrigo Valério","João Bordalo","Inês Simões","Vasco Ramos","David Semedo","João Magalhães"],"pdf_url":"https://arxiv.org/pdf/2310.02118v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11972v1","updated":"2024-01-22T14:24:03Z","published":"2024-01-22T14:24:03Z","title":"Synergizing Machine Learning & Symbolic Methods: A Survey on Hybrid\n Approaches to Natural Language Processing","summary":" The advancement of machine learning and symbolic approaches have underscored\ntheir strengths and weaknesses in Natural Language Processing (NLP). While\nmachine learning approaches are powerful in identifying patterns in data, they\noften fall short in learning commonsense and the factual knowledge required for\nthe NLP tasks. Meanwhile, the symbolic methods excel in representing\nknowledge-rich data. However, they struggle to adapt dynamic data and\ngeneralize the knowledge. Bridging these two paradigms through hybrid\napproaches enables the alleviation of weaknesses in both while preserving their\nstrengths. Recent studies extol the virtues of this union, showcasing promising\nresults in a wide range of NLP tasks. In this paper, we present an overview of\nhybrid approaches used for NLP. Specifically, we delve into the\nstate-of-the-art hybrid approaches used for a broad spectrum of NLP tasks\nrequiring natural language understanding, generation, and reasoning.\nFurthermore, we discuss the existing resources available for hybrid approaches\nfor NLP along with the challenges, offering a roadmap for future directions.\n","authors":["Rrubaa Panchendrarajan","Arkaitz Zubiaga"],"pdf_url":"https://arxiv.org/pdf/2401.11972v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11969v1","updated":"2024-01-22T14:17:03Z","published":"2024-01-22T14:17:03Z","title":"Claim Detection for Automated Fact-checking: A Survey on Monolingual,\n Multilingual and Cross-Lingual Research","summary":" Automated fact-checking has drawn considerable attention over the past few\ndecades due to the increase in the diffusion of misinformation on online\nplatforms. This is often carried out as a sequence of tasks comprising (i) the\ndetection of sentences circulating in online platforms which constitute claims\nneeding verification, followed by (ii) the verification process of those\nclaims. This survey focuses on the former, by discussing existing efforts\ntowards detecting claims needing fact-checking, with a particular focus on\nmultilingual data and methods. This is a challenging and fertile direction\nwhere existing methods are yet far from matching human performance due to the\nprofoundly challenging nature of the issue. Especially, the dissemination of\ninformation across multiple social platforms, articulated in multiple languages\nand modalities demands more generalized solutions for combating misinformation.\nFocusing on multilingual misinformation, we present a comprehensive survey of\nexisting multilingual claim detection research. We present state-of-the-art\nmultilingual claim detection research categorized into three key factors of the\nproblem, verifiability, priority, and similarity. Further, we present a\ndetailed overview of the existing multilingual datasets along with the\nchallenges and suggest possible future advancements.\n","authors":["Rrubaa Panchendrarajan","Arkaitz Zubiaga"],"pdf_url":"https://arxiv.org/pdf/2401.11969v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.14578v2","updated":"2024-01-22T14:13:51Z","published":"2023-05-23T23:31:24Z","title":"Connecting the Dots: What Graph-Based Text Representations Work Best for\n Text Classification Using Graph Neural Networks?","summary":" Given the success of Graph Neural Networks (GNNs) for structure-aware machine\nlearning, many studies have explored their use for text classification, but\nmostly in specific domains with limited data characteristics. Moreover, some\nstrategies prior to GNNs relied on graph mining and classical machine learning,\nmaking it difficult to assess their effectiveness in modern settings. This work\nextensively investigates graph representation methods for text classification,\nidentifying practical implications and open challenges. We compare different\ngraph construction schemes using a variety of GNN architectures and setups\nacross five datasets, encompassing short and long documents as well as\nunbalanced scenarios in diverse domains. Two Transformer-based large language\nmodels are also included to complement the study. The results show that i)\nalthough the effectiveness of graphs depends on the textual input features and\ndomain, simple graph constructions perform better the longer the documents are,\nii) graph representations are especially beneficial for longer documents,\noutperforming Transformer-based models, iii) graph methods are particularly\nefficient at solving the task.\n","authors":["Margarita Bugueño","Gerard de Melo"],"pdf_url":"https://arxiv.org/pdf/2305.14578v2.pdf","comment":"Accepted to Findings of the Association for Computational\n Linguistics: EMNLP 2023 (Long Paper). 17 pages, 2 figures, 15 tables. The\n Appendix starts on page 12"},{"id":"http://arxiv.org/abs/2310.01386v2","updated":"2024-01-22T13:58:50Z","published":"2023-10-02T17:46:09Z","title":"Who is ChatGPT? Benchmarking LLMs' Psychological Portrayal Using\n PsychoBench","summary":" Large Language Models (LLMs) have recently showcased their remarkable\ncapacities, not only in natural language processing tasks but also across\ndiverse domains such as clinical medicine, legal consultation, and education.\nLLMs become more than mere applications, evolving into assistants capable of\naddressing diverse user requests. This narrows the distinction between human\nbeings and artificial intelligence agents, raising intriguing questions\nregarding the potential manifestation of personalities, temperaments, and\nemotions within LLMs. In this paper, we propose a framework, PsychoBench, for\nevaluating diverse psychological aspects of LLMs. Comprising thirteen scales\ncommonly used in clinical psychology, PsychoBench further classifies these\nscales into four distinct categories: personality traits, interpersonal\nrelationships, motivational tests, and emotional abilities. Our study examines\nfive popular models, namely text-davinci-003, gpt-3.5-turbo, gpt-4, LLaMA-2-7b,\nand LLaMA-2-13b. Additionally, we employ a jailbreak approach to bypass the\nsafety alignment protocols and test the intrinsic natures of LLMs. We have made\nPsychoBench openly accessible via https://github.com/CUHK-ARISE/PsychoBench.\n","authors":["Jen-tse Huang","Wenxuan Wang","Eric John Li","Man Ho Lam","Shujie Ren","Youliang Yuan","Wenxiang Jiao","Zhaopeng Tu","Michael R. Lyu"],"pdf_url":"https://arxiv.org/pdf/2310.01386v2.pdf","comment":"Accepted for ICLR 2024 Oral Presentation. 15 pages (main text) and 5\n pages (appendix)"},{"id":"http://arxiv.org/abs/2401.11944v1","updated":"2024-01-22T13:34:34Z","published":"2024-01-22T13:34:34Z","title":"CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding\n Benchmark","summary":" As the capabilities of large multimodal models (LMMs) continue to advance,\nevaluating the performance of LMMs emerges as an increasing need. Additionally,\nthere is an even larger gap in evaluating the advanced knowledge and reasoning\nabilities of LMMs in non-English contexts such as Chinese. We introduce CMMMU,\na new Chinese Massive Multi-discipline Multimodal Understanding benchmark\ndesigned to evaluate LMMs on tasks demanding college-level subject knowledge\nand deliberate reasoning in a Chinese context. CMMMU is inspired by and\nstrictly follows the annotation and analysis pattern of MMMU.\n CMMMU includes 12k manually collected multimodal questions from college\nexams, quizzes, and textbooks, covering six core disciplines: Art & Design,\nBusiness, Science, Health & Medicine, Humanities & Social Science, and Tech &\nEngineering, like its companion, MMMU. These questions span 30 subjects and\ncomprise 39 highly heterogeneous image types, such as charts, diagrams, maps,\ntables, music sheets, and chemical structures.\n CMMMU focuses on complex perception and reasoning with domain-specific\nknowledge in the Chinese context. We evaluate 11 open-source LLMs and one\nproprietary GPT-4V(ision). Even GPT-4V only achieves accuracies of 42%,\nindicating a large space for improvement. CMMMU will boost the community to\nbuild the next-generation LMMs towards expert artificial intelligence and\npromote the democratization of LMMs by providing diverse language contexts.\n","authors":["Ge Zhang","Xinrun Du","Bei Chen","Yiming Liang","Tongxu Luo","Tianyu Zheng","Kang Zhu","Yuyang Cheng","Chunpu Xu","Shuyue Guo","Haoran Zhang","Xingwei Qu","Junjie Wang","Ruibin Yuan","Yizhi Li","Zekun Wang","Yudong Liu","Yu-Hsuan Tsai","Fengji Zhang","Chenghua Lin","Wenhao Huang","Wenhu Chen","Jie Fu"],"pdf_url":"https://arxiv.org/pdf/2401.11944v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11943v1","updated":"2024-01-22T13:33:53Z","published":"2024-01-22T13:33:53Z","title":"Benchmarking Large Multimodal Models against Common Corruptions","summary":" This technical report aims to fill a deficiency in the assessment of large\nmultimodal models (LMMs) by specifically examining the self-consistency of\ntheir outputs when subjected to common corruptions. We investigate the\ncross-modal interactions between text, image, and speech, encompassing four\nessential generation tasks: text-to-image, image-to-text, text-to-speech, and\nspeech-to-text. We create a comprehensive benchmark, named MMCBench, that\ncovers more than 100 popular LMMs (totally over 150 model checkpoints). A\nthorough evaluation under common corruptions is critical for practical\ndeployment and facilitates a better understanding of the reliability of\ncutting-edge LMMs. The benchmarking code is available at\nhttps://github.com/sail-sg/MMCBench\n","authors":["Jiawei Zhang","Tianyu Pang","Chao Du","Yi Ren","Bo Li","Min Lin"],"pdf_url":"https://arxiv.org/pdf/2401.11943v1.pdf","comment":"Technical report"},{"id":"http://arxiv.org/abs/2401.11911v1","updated":"2024-01-22T12:54:04Z","published":"2024-01-22T12:54:04Z","title":"Blinded by Generated Contexts: How Language Models Merge Generated and\n Retrieved Contexts for Open-Domain QA?","summary":" While auxiliary information has become a key to enhance Large Language Models\n(LLMs), relatively little is known about how well LLMs merge these contexts,\nspecifically generated and retrieved. To study this, we formulate a task\nspecifically designed to identify whether the answers, derived from the\nintegration of generated and retrieved contexts, are attributed to either\ngenerated or retrieved contexts. To support this task, we develop a methodology\nto construct datasets with conflicting contexts, where each question is paired\nwith both generated and retrieved contexts, yet only one of them contains the\ncorrect answer. Our experiments reveal a significant bias in LLMs towards\ngenerated contexts, as evidenced across state-of-the-art open (Llama2-7b/13b)\nand closed (GPT 3.5/4) systems. We further identify two key factors\ncontributing to this bias: i) Contexts generated by LLMs typically show greater\nsimilarity to the questions, increasing their likelihood of selection; ii) The\nsegmentation process used in retrieved contexts disrupts their completeness,\nthereby hindering their full utilization in LLMs. Our analysis enhances the\nunderstanding of how LLMs merge diverse contexts, offering valuable insights\nfor advancing current augmentation methods for LLMs.\n","authors":["Hexiang Tan","Fei Sun","Wanli Yang","Yuanzhuo Wang","Qi Cao","Xueqi Cheng"],"pdf_url":"https://arxiv.org/pdf/2401.11911v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10337v2","updated":"2024-01-22T12:33:43Z","published":"2024-01-18T19:02:00Z","title":"Noise Contrastive Estimation-based Matching Framework for Low-resource\n Security Attack Pattern Recognition","summary":" Tactics, Techniques and Procedures (TTPs) represent sophisticated attack\npatterns in the cybersecurity domain, described encyclopedically in textual\nknowledge bases. Identifying TTPs in cybersecurity writing, often called TTP\nmapping, is an important and challenging task. Conventional learning approaches\noften target the problem in the classical multi-class or multilabel\nclassification setting. This setting hinders the learning ability of the model\ndue to a large number of classes (i.e., TTPs), the inevitable skewness of the\nlabel distribution and the complex hierarchical structure of the label space.\nWe formulate the problem in a different learning paradigm, where the assignment\nof a text to a TTP label is decided by the direct semantic similarity between\nthe two, thus reducing the complexity of competing solely over the large\nlabeling space. To that end, we propose a neural matching architecture with an\neffective sampling-based learn-to-compare mechanism, facilitating the learning\nprocess of the matching model despite constrained resources.\n","authors":["Tu Nguyen","Nedim Srndic","Alexander Neth"],"pdf_url":"https://arxiv.org/pdf/2401.10337v2.pdf","comment":"accepted at EACL 2024, in ARR October 2023"},{"id":"http://arxiv.org/abs/2311.07989v4","updated":"2024-01-22T12:27:47Z","published":"2023-11-14T08:34:26Z","title":"Unifying the Perspectives of NLP and Software Engineering: A Survey on\n Language Models for Code","summary":" In this work we systematically review the recent advancements in code\nprocessing with language models, covering 50+ models, 30+ evaluation tasks,\n170+ datasets, and 700+ related works. We break down code processing models\ninto general language models represented by the GPT family and specialized\nmodels that are specifically pretrained on code, often with tailored\nobjectives. We discuss the relations and differences between these models, and\nhighlight the historical transition of code modeling from statistical models\nand RNNs to pretrained Transformers and LLMs, which is exactly the same course\nthat had been taken by NLP. We also discuss code-specific features such as AST,\nCFG, and unit tests, along with their application in training code language\nmodels, and identify key challenges and potential future directions in this\ndomain. We keep the survey open and updated on GitHub at\nhttps://github.com/codefuse-ai/Awesome-Code-LLM.\n","authors":["Ziyin Zhang","Chaoyu Chen","Bingchang Liu","Cong Liao","Zi Gong","Hang Yu","Jianguo Li","Rui Wang"],"pdf_url":"https://arxiv.org/pdf/2311.07989v4.pdf","comment":"Repo is available at https://github.com/codefuse-ai/Awesome-Code-LLM.\n 8 figures, 10 tables, and 713 references"},{"id":"http://arxiv.org/abs/2401.11880v1","updated":"2024-01-22T12:11:55Z","published":"2024-01-22T12:11:55Z","title":"PsySafe: A Comprehensive Framework for Psychological-based Attack,\n Defense, and Evaluation of Multi-agent System Safety","summary":" Multi-agent systems, augmented with Large Language Models (LLMs), demonstrate\nsignificant capabilities for collective intelligence. However, the potential\nmisuse of this intelligence for malicious purposes presents significant risks.\nTo date, comprehensive research on the safety issues associated with\nmulti-agent systems remains limited. From the perspective of agent psychology,\nwe discover that the dark psychological states of agents can lead to severe\nsafety issues. To address these issues, we propose a comprehensive framework\ngrounded in agent psychology. In our framework, we focus on three aspects:\nidentifying how dark personality traits in agents might lead to risky\nbehaviors, designing defense strategies to mitigate these risks, and evaluating\nthe safety of multi-agent systems from both psychological and behavioral\nperspectives. Our experiments reveal several intriguing phenomena, such as the\ncollective dangerous behaviors among agents, agents' propensity for\nself-reflection when engaging in dangerous behavior, and the correlation\nbetween agents' psychological assessments and their dangerous behaviors. We\nanticipate that our framework and observations will provide valuable insights\nfor further research into the safety of multi-agent systems. We will make our\ndata and code publicly accessible at https:/github.com/AI4Good24/PsySafe.\n","authors":["Zaibin Zhang","Yongting Zhang","Lijun Li","Hongzhi Gao","Lijun Wang","Huchuan Lu","Feng Zhao","Yu Qiao","Jing Shao"],"pdf_url":"https://arxiv.org/pdf/2401.11880v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11864v1","updated":"2024-01-22T11:37:18Z","published":"2024-01-22T11:37:18Z","title":"Improving Small Language Models' Mathematical Reasoning via Mix Thoughts\n Distillation","summary":" This work addresses the challenge of democratizing advanced Large Language\nModels (LLMs) by compressing their mathematical reasoning capabilities into\nsub-billion parameter Small Language Models (SLMs) without compromising\nperformance. We introduce Equation-of-Thought Distillation (EoTD), a novel\ntechnique that encapsulates the reasoning process into equation-based\nrepresentations to construct an EoTD dataset for fine-tuning SLMs.\nAdditionally, we propose the Mix Thoughts Distillation (MTD) framework to\nenhance the reasoning performance of SLMs. This involves creating a reasoning\ndataset with multiple thought processes and using it for fine-tuning. Our\nexperimental findings demonstrate that EoTD significantly boosts the reasoning\nabilities of SLMs, while MTD enables these models to achieve state-of-the-art\nreasoning performance.\n","authors":["Xunyu Zhu","Jian Li","Yong Liu","Can Ma","Weiping Wang"],"pdf_url":"https://arxiv.org/pdf/2401.11864v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11852v1","updated":"2024-01-22T11:15:07Z","published":"2024-01-22T11:15:07Z","title":"The Right Model for the Job: An Evaluation of Legal Multi-Label\n Classification Baselines","summary":" Multi-Label Classification (MLC) is a common task in the legal domain, where\nmore than one label may be assigned to a legal document. A wide range of\nmethods can be applied, ranging from traditional ML approaches to the latest\nTransformer-based architectures. In this work, we perform an evaluation of\ndifferent MLC methods using two public legal datasets, POSTURE50K and\nEURLEX57K. By varying the amount of training data and the number of labels, we\nexplore the comparative advantage offered by different approaches in relation\nto the dataset properties. Our findings highlight DistilRoBERTa and LegalBERT\nas performing consistently well in legal MLC with reasonable computational\ndemands. T5 also demonstrates comparable performance while offering advantages\nas a generative model in the presence of changing label sets. Finally, we show\nthat the CrossEncoder exhibits potential for notable macro-F1 score\nimprovements, albeit with increased computational costs.\n","authors":["Martina Forster","Claudia Schulz","Prudhvi Nokku","Melicaalsadat Mirsafian","Jaykumar Kasundra","Stavroula Skylaki"],"pdf_url":"https://arxiv.org/pdf/2401.11852v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11839v1","updated":"2024-01-22T10:57:09Z","published":"2024-01-22T10:57:09Z","title":"AI for social science and social science of AI: A Survey","summary":" Recent advancements in artificial intelligence, particularly with the\nemergence of large language models (LLMs), have sparked a rethinking of\nartificial general intelligence possibilities. The increasing human-like\ncapabilities of AI are also attracting attention in social science research,\nleading to various studies exploring the combination of these two fields. In\nthis survey, we systematically categorize previous explorations in the\ncombination of AI and social science into two directions that share common\ntechnical approaches but differ in their research objectives. The first\ndirection is focused on AI for social science, where AI is utilized as a\npowerful tool to enhance various stages of social science research. While the\nsecond direction is the social science of AI, which examines AI agents as\nsocial entities with their human-like cognitive and linguistic capabilities. By\nconducting a thorough review, particularly on the substantial progress\nfacilitated by recent advancements in large language models, this paper\nintroduces a fresh perspective to reassess the relationship between AI and\nsocial science, provides a cohesive framework that allows researchers to\nunderstand the distinctions and connections between AI for social science and\nsocial science of AI, and also summarized state-of-art experiment simulation\nplatforms to facilitate research in these two directions. We believe that as AI\ntechnology continues to advance and intelligent agents find increasing\napplications in our daily lives, the significance of the combination of AI and\nsocial science will become even more prominent.\n","authors":["Ruoxi Xu","Yingfei Sun","Mengjie Ren","Shiguang Guo","Ruotong Pan","Hongyu Lin","Le Sun","Xianpei Han"],"pdf_url":"https://arxiv.org/pdf/2401.11839v1.pdf","comment":"Accepted by Information Processing and Management (IP&M)"},{"id":"http://arxiv.org/abs/2401.11819v1","updated":"2024-01-22T10:30:11Z","published":"2024-01-22T10:30:11Z","title":"SuperCLUE-Math6: Graded Multi-Step Math Reasoning Benchmark for LLMs in\n Chinese","summary":" We introduce SuperCLUE-Math6(SC-Math6), a new benchmark dataset to evaluate\nthe mathematical reasoning abilities of Chinese language models. SC-Math6 is\ndesigned as an upgraded Chinese version of the GSM8K dataset with enhanced\ndifficulty, diversity, and application scope. It consists of over 2000\nmathematical word problems requiring multi-step reasoning and providing natural\nlanguage solutions. We propose an innovative scheme to quantify the reasoning\ncapability of large models based on performance over problems with different\nreasoning steps. Experiments on 12 representative Chinese models demonstrate a\nclear stratification of reasoning levels, with top models like GPT-4 showing\nsuperior performance. SC-Math6 fills the gap in Chinese mathematical reasoning\nbenchmarks and provides a comprehensive testbed to advance the intelligence of\nChinese language models.\n","authors":["Liang Xu","Hang Xue","Lei Zhu","Kangkang Zhao"],"pdf_url":"https://arxiv.org/pdf/2401.11819v1.pdf","comment":"8 pages, 7 figures, 4 tables"},{"id":"http://arxiv.org/abs/2401.11817v1","updated":"2024-01-22T10:26:14Z","published":"2024-01-22T10:26:14Z","title":"Hallucination is Inevitable: An Innate Limitation of Large Language\n Models","summary":" Hallucination has been widely recognized to be a significant drawback for\nlarge language models (LLMs). There have been many works that attempt to reduce\nthe extent of hallucination. These efforts have mostly been empirical so far,\nwhich cannot answer the fundamental question whether it can be completely\neliminated. In this paper, we formalize the problem and show that it is\nimpossible to eliminate hallucination in LLMs. Specifically, we define a formal\nworld where hallucination is defined as inconsistencies between a computable\nLLM and a computable ground truth function. By employing results from learning\ntheory, we show that LLMs cannot learn all of the computable functions and will\ntherefore always hallucinate. Since the formal world is a part of the real\nworld which is much more complicated, hallucinations are also inevitable for\nreal world LLMs. Furthermore, for real world LLMs constrained by provable time\ncomplexity, we describe the hallucination-prone tasks and empirically validate\nour claims. Finally, using the formal world framework, we discuss the possible\nmechanisms and efficacies of existing hallucination mitigators as well as the\npractical implications on the safe deployment of LLMs.\n","authors":["Ziwei Xu","Sanjay Jain","Mohan Kankanhalli"],"pdf_url":"https://arxiv.org/pdf/2401.11817v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11791v1","updated":"2024-01-22T09:41:05Z","published":"2024-01-22T09:41:05Z","title":"SemPLeS: Semantic Prompt Learning for Weakly-Supervised Semantic\n Segmentation","summary":" Weakly-Supervised Semantic Segmentation (WSSS) aims to train segmentation\nmodels using training image data with only image-level supervision. Since\nprecise pixel-level annotations are not accessible, existing methods typically\nfocus on producing pseudo masks for training segmentation models by refining\nCAM-like heatmaps. However, the produced heatmaps may only capture\ndiscriminative image regions of target object categories or the associated\nco-occurring backgrounds. To address the issues, we propose a Semantic Prompt\nLearning for WSSS (SemPLeS) framework, which learns to effectively prompt the\nCLIP space to enhance the semantic alignment between the segmented regions and\nthe target object categories. More specifically, we propose Contrastive Prompt\nLearning and Class-associated Semantic Refinement to learn the prompts that\nadequately describe and suppress the image backgrounds associated with each\ntarget object category. In this way, our proposed framework is able to perform\nbetter semantic matching between object regions and the associated text labels,\nresulting in desired pseudo masks for training the segmentation model. The\nproposed SemPLeS framework achieves SOTA performance on the standard WSSS\nbenchmarks, PASCAL VOC and MS COCO, and demonstrated interpretability with the\nsemantic visualization of our learned prompts. The codes will be released.\n","authors":["Ci-Siang Lin","Chien-Yi Wang","Yu-Chiang Frank Wang","Min-Hung Chen"],"pdf_url":"https://arxiv.org/pdf/2401.11791v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.04408v3","updated":"2024-01-22T07:40:02Z","published":"2023-07-10T08:15:40Z","title":"TIM: Teaching Large Language Models to Translate with Comparison","summary":" Open-sourced large language models (LLMs) have demonstrated remarkable\nefficacy in various tasks with instruction tuning. However, these models can\nsometimes struggle with tasks that require more specialized knowledge such as\ntranslation. One possible reason for such deficiency is that instruction tuning\naims to generate fluent and coherent text that continues from a given\ninstruction without being constrained by any task-specific requirements.\nMoreover, it can be more challenging for tuning smaller LLMs with lower-quality\ntraining data. To address this issue, we propose a novel framework using\nexamples in comparison to teach LLMs to learn translation. Our approach\ninvolves presenting the model with examples of correct and incorrect\ntranslations and using a preference loss to guide the model's learning. We\nevaluate our method on WMT2022 test sets and show that it outperforms existing\nmethods. Our findings offer a new perspective on fine-tuning LLMs for\ntranslation tasks and provide a promising solution for generating high-quality\ntranslations. Please refer to Github for more details:\nhttps://github.com/lemon0830/TIM.\n","authors":["Jiali Zeng","Fandong Meng","Yongjing Yin","Jie Zhou"],"pdf_url":"https://arxiv.org/pdf/2307.04408v3.pdf","comment":"AAAI 2024"},{"id":"http://arxiv.org/abs/2309.12247v2","updated":"2024-01-22T07:24:30Z","published":"2023-09-21T16:47:30Z","title":"Bad Actor, Good Advisor: Exploring the Role of Large Language Models in\n Fake News Detection","summary":" Detecting fake news requires both a delicate sense of diverse clues and a\nprofound understanding of the real-world background, which remains challenging\nfor detectors based on small language models (SLMs) due to their knowledge and\ncapability limitations. Recent advances in large language models (LLMs) have\nshown remarkable performance in various tasks, but whether and how LLMs could\nhelp with fake news detection remains underexplored. In this paper, we\ninvestigate the potential of LLMs in fake news detection. First, we conduct an\nempirical study and find that a sophisticated LLM such as GPT 3.5 could\ngenerally expose fake news and provide desirable multi-perspective rationales\nbut still underperforms the basic SLM, fine-tuned BERT. Our subsequent analysis\nattributes such a gap to the LLM's inability to select and integrate rationales\nproperly to conclude. Based on these findings, we propose that current LLMs may\nnot substitute fine-tuned SLMs in fake news detection but can be a good advisor\nfor SLMs by providing multi-perspective instructive rationales. To instantiate\nthis proposal, we design an adaptive rationale guidance network for fake news\ndetection (ARG), in which SLMs selectively acquire insights on news analysis\nfrom the LLMs' rationales. We further derive a rationale-free version of ARG by\ndistillation, namely ARG-D, which services cost-sensitive scenarios without\nquerying LLMs. Experiments on two real-world datasets demonstrate that ARG and\nARG-D outperform three types of baseline methods, including SLM-based,\nLLM-based, and combinations of small and large language models.\n","authors":["Beizhe Hu","Qiang Sheng","Juan Cao","Yuhui Shi","Yang Li","Danding Wang","Peng Qi"],"pdf_url":"https://arxiv.org/pdf/2309.12247v2.pdf","comment":"16 pages, 5 figures, and 9 tables. To appear at AAAI 2024"},{"id":"http://arxiv.org/abs/2401.11725v1","updated":"2024-01-22T07:07:06Z","published":"2024-01-22T07:07:06Z","title":"Speak It Out: Solving Symbol-Related Problems with Symbol-to-Language\n Conversion for Language Models","summary":" Symbols (or more broadly, non-natural language textual representations) such\nas numerical sequences, molecular formulas, and table delimiters widely exist,\nplaying important roles in various tasks such as abstract reasoning, chemical\nproperty prediction, and table question answering. Despite the impressive\nnatural language comprehension capabilities of large language models (LLMs),\ntheir reasoning abilities for symbols remain inadequate, which could attributed\nto the difference between symbol representations and general natural languages.\nWe propose symbol-to-language (S2L), a tuning-free method that enables large\nlanguage models to solve symbol-related problems with information expressed in\nnatural language. Specifically, S2L first converts the symbols involved to\nlanguage-based representations, which can be implemented by prompting LLMs or\nleveraging external tools, then these language-based representations are\nintegrated into the original problem via direct substitution or concatenation,\nserving as useful input information for LLMs. We evaluate the S2L method using\nboth API-based (GPT-4, ChatGPT) and open-source (OpenChat) models over eight\nsymbol-related tasks, ranging from symbol-only abstract reasoning to sentiment\nanalysis in social media. Experimental results show that S2L consistently leads\nto superior performance. For example, by employing S2L for GPT-4, there can be\naverage significant improvements of +21.9% and +9.5% for subtasks in 1D-ARC and\nDyck language, respectively. Codes and data are available at\nhttps://github.com/THUNLP-MT/symbol2language.\n","authors":["Yile Wang","Sijie Cheng","Zixin Sun","Peng Li","Yang Liu"],"pdf_url":"https://arxiv.org/pdf/2401.11725v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.09798v2","updated":"2024-01-22T06:22:55Z","published":"2024-01-18T08:36:54Z","title":"All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks","summary":" Large Language Models (LLMs) like ChatGPT face `jailbreak' challenges, where\nsafeguards are bypassed to produce ethically harmful prompts. This study\nproposes a simple black-box method to effectively generate jailbreak prompts,\novercoming the high complexity and computational costs associated with existing\nmethods. The proposed technique iteratively rewrites harmful prompts into\nnon-harmful expressions using the target LLM itself, based on the hypothesis\nthat LLMs can directly sample expressions that bypass safeguards. Demonstrated\nthrough experiments with ChatGPT (GPT-3.5 and GPT-4) and Gemini-Pro, this\nmethod achieved an attack success rate of over 80% within an average of 5\niterations and remained effective despite model updates. The generated\njailbreak prompts were naturally-worded and concise; moreover, they were\ndifficult-to-defend. These results indicate that creating effective jailbreak\nprompts is simpler than previously considered, suggesting that black-box\njailbreak attacks pose a more serious threat.\n","authors":["Kazuhiro Takemoto"],"pdf_url":"https://arxiv.org/pdf/2401.09798v2.pdf","comment":"12 pages, 4 figures, 2 tables"},{"id":"http://arxiv.org/abs/2401.11700v1","updated":"2024-01-22T05:46:11Z","published":"2024-01-22T05:46:11Z","title":"Keep Decoding Parallel with Effective Knowledge Distillation from\n Language Models to End-to-end Speech Recognisers","summary":" This study presents a novel approach for knowledge distillation (KD) from a\nBERT teacher model to an automatic speech recognition (ASR) model using\nintermediate layers. To distil the teacher's knowledge, we use an attention\ndecoder that learns from BERT's token probabilities. Our method shows that\nlanguage model (LM) information can be more effectively distilled into an ASR\nmodel using both the intermediate layers and the final layer. By using the\nintermediate layers as distillation target, we can more effectively distil LM\nknowledge into the lower network layers. Using our method, we achieve better\nrecognition accuracy than with shallow fusion of an external LM, allowing us to\nmaintain fast parallel decoding. Experiments on the LibriSpeech dataset\ndemonstrate the effectiveness of our approach in enhancing greedy decoding with\nconnectionist temporal classification (CTC).\n","authors":["Michael Hentschel","Yuta Nishikawa","Tatsuya Komatsu","Yusuke Fujita"],"pdf_url":"https://arxiv.org/pdf/2401.11700v1.pdf","comment":"Accepted at ICASSP 2024"},{"id":"http://arxiv.org/abs/2304.03047v3","updated":"2024-01-22T04:57:32Z","published":"2023-04-06T13:07:17Z","title":"ETPNav: Evolving Topological Planning for Vision-Language Navigation in\n Continuous Environments","summary":" Vision-language navigation is a task that requires an agent to follow\ninstructions to navigate in environments. It becomes increasingly crucial in\nthe field of embodied AI, with potential applications in autonomous navigation,\nsearch and rescue, and human-robot interaction. In this paper, we propose to\naddress a more practical yet challenging counterpart setting - vision-language\nnavigation in continuous environments (VLN-CE). To develop a robust VLN-CE\nagent, we propose a new navigation framework, ETPNav, which focuses on two\ncritical skills: 1) the capability to abstract environments and generate\nlong-range navigation plans, and 2) the ability of obstacle-avoiding control in\ncontinuous environments. ETPNav performs online topological mapping of\nenvironments by self-organizing predicted waypoints along a traversed path,\nwithout prior environmental experience. It privileges the agent to break down\nthe navigation procedure into high-level planning and low-level control.\nConcurrently, ETPNav utilizes a transformer-based cross-modal planner to\ngenerate navigation plans based on topological maps and instructions. The plan\nis then performed through an obstacle-avoiding controller that leverages a\ntrial-and-error heuristic to prevent navigation from getting stuck in\nobstacles. Experimental results demonstrate the effectiveness of the proposed\nmethod. ETPNav yields more than 10% and 20% improvements over prior\nstate-of-the-art on R2R-CE and RxR-CE datasets, respectively. Our code is\navailable at https://github.com/MarSaKi/ETPNav.\n","authors":["Dong An","Hanqing Wang","Wenguan Wang","Zun Wang","Yan Huang","Keji He","Liang Wang"],"pdf_url":"https://arxiv.org/pdf/2304.03047v3.pdf","comment":"Project page: https://github.com/MarSaKi/ETPNav"},{"id":"http://arxiv.org/abs/2305.05352v6","updated":"2024-01-22T04:15:13Z","published":"2023-05-09T11:37:16Z","title":"A Taxonomy of Foundation Model based Systems through the Lens of\n Software Architecture","summary":" The recent release of large language model (LLM) based chatbots, such as\nChatGPT, has attracted huge interest in foundation models. It is widely\nbelieved that foundation models will serve as the fundamental building blocks\nfor future AI systems. As foundation models are in their early stages, the\ndesign of foundation model based systems has not yet been systematically\nexplored. There is limited understanding about the impact of introducing\nfoundation models in software architecture. Therefore, in this paper, we\npropose a taxonomy of foundation model based systems, which classifies and\ncompares the characteristics of foundation models and design options of\nfoundation model based systems. Our taxonomy comprises three categories: the\npretraining and adaptation of foundation models, the architecture design of\nfoundation model based systems, and responsible-AI-by-design. This taxonomy can\nserve as concrete guidance for making major architectural design decisions when\ndesigning foundation model based systems and highlights trade-offs arising from\ndesign decisions.\n","authors":["Qinghua Lu","Liming Zhu","Xiwei Xu","Yue Liu","Zhenchang Xing","Jon Whittle"],"pdf_url":"https://arxiv.org/pdf/2305.05352v6.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.01538v3","updated":"2024-01-22T02:39:17Z","published":"2023-09-04T11:38:02Z","title":"ChatRule: Mining Logical Rules with Large Language Models for Knowledge\n Graph Reasoning","summary":" Logical rules are essential for uncovering the logical connections between\nrelations, which could improve reasoning performance and provide interpretable\nresults on knowledge graphs (KGs). Although there have been many efforts to\nmine meaningful logical rules over KGs, existing methods suffer from\ncomputationally intensive searches over the rule space and a lack of\nscalability for large-scale KGs. Besides, they often ignore the semantics of\nrelations which is crucial for uncovering logical connections. Recently, large\nlanguage models (LLMs) have shown impressive performance in the field of\nnatural language processing and various applications, owing to their emergent\nability and generalizability. In this paper, we propose a novel framework,\nChatRule, unleashing the power of large language models for mining logical\nrules over knowledge graphs. Specifically, the framework is initiated with an\nLLM-based rule generator, leveraging both the semantic and structural\ninformation of KGs to prompt LLMs to generate logical rules. To refine the\ngenerated rules, a rule ranking module estimates the rule quality by\nincorporating facts from existing KGs. Last, the ranked rules can be used to\nconduct reasoning over KGs. ChatRule is evaluated on four large-scale KGs,\nw.r.t. different rule quality metrics and downstream tasks, showing the\neffectiveness and scalability of our method.\n","authors":["Linhao Luo","Jiaxin Ju","Bo Xiong","Yuan-Fang Li","Gholamreza Haffari","Shirui Pan"],"pdf_url":"https://arxiv.org/pdf/2309.01538v3.pdf","comment":"11 pages, 4 figures"},{"id":"http://arxiv.org/abs/2401.11645v1","updated":"2024-01-22T01:44:42Z","published":"2024-01-22T01:44:42Z","title":"Streaming Bilingual End-to-End ASR model using Attention over Multiple\n Softmax","summary":" Even with several advancements in multilingual modeling, it is challenging to\nrecognize multiple languages using a single neural model, without knowing the\ninput language and most multilingual models assume the availability of the\ninput language. In this work, we propose a novel bilingual end-to-end (E2E)\nmodeling approach, where a single neural model can recognize both languages and\nalso support switching between the languages, without any language input from\nthe user. The proposed model has shared encoder and prediction networks, with\nlanguage-specific joint networks that are combined via a self-attention\nmechanism. As the language-specific posteriors are combined, it produces a\nsingle posterior probability over all the output symbols, enabling a single\nbeam search decoding and also allowing dynamic switching between the languages.\nThe proposed approach outperforms the conventional bilingual baseline with\n13.3%, 8.23% and 1.3% word error rate relative reduction on Hindi, English and\ncode-mixed test sets, respectively.\n","authors":["Aditya Patil","Vikas Joshi","Purvi Agrawal","Rupesh Mehta"],"pdf_url":"https://arxiv.org/pdf/2401.11645v1.pdf","comment":"Published in IEEE's Spoken Language Technology (SLT) 2022, 8 pages (6\n + 2 for references), 5 figures"},{"id":"http://arxiv.org/abs/2109.01636v4","updated":"2024-01-22T01:23:23Z","published":"2021-09-03T17:28:04Z","title":"Empirical Study of Named Entity Recognition Performance Using\n Distribution-aware Word Embedding","summary":" With the fast development of Deep Learning techniques, Named Entity\nRecognition (NER) is becoming more and more important in the information\nextraction task. The greatest difficulty that the NER task faces is to keep the\ndetectability even when types of NE and documents are unfamiliar. Realizing\nthat the specificity information may contain potential meanings of a word and\ngenerate semantic-related features for word embedding, we develop a\ndistribution-aware word embedding and implement three different methods to make\nuse of the distribution information in a NER framework. And the result shows\nthat the performance of NER will be improved if the word specificity is\nincorporated into existing NER methods.\n","authors":["Xin Chen","Qi Zhao","Xinyang Liu"],"pdf_url":"https://arxiv.org/pdf/2109.01636v4.pdf","comment":"Want to correct"},{"id":"http://arxiv.org/abs/2401.11641v1","updated":"2024-01-22T01:06:17Z","published":"2024-01-22T01:06:17Z","title":"Revolutionizing Finance with LLMs: An Overview of Applications and\n Insights","summary":" In recent years, Large Language Models (LLMs) like ChatGPT have seen\nconsiderable advancements and have been applied in diverse fields. Built on the\nTransformer architecture, these models are trained on extensive datasets,\nenabling them to understand and generate human language effectively. In the\nfinancial domain, the deployment of LLMs is gaining momentum. These models are\nbeing utilized for automating financial report generation, forecasting market\ntrends, analyzing investor sentiment, and offering personalized financial\nadvice. Leveraging their natural language processing capabilities, LLMs can\ndistill key insights from vast financial data, aiding institutions in making\ninformed investment choices and enhancing both operational efficiency and\ncustomer satisfaction. In this study, we provide a comprehensive overview of\nthe emerging integration of LLMs into various financial tasks. Additionally, we\nconducted holistic tests on multiple financial tasks through the combination of\nnatural language instructions. Our findings show that GPT-4 effectively follow\nprompt instructions across various financial tasks. This survey and evaluation\nof LLMs in the financial domain aim to deepen the understanding of LLMs'\ncurrent role in finance for both financial practitioners and LLM researchers,\nidentify new research and application prospects, and highlight how these\ntechnologies can be leveraged to solve practical challenges in the finance\nindustry.\n","authors":["Huaqin Zhao","Zhengliang Liu","Zihao Wu","Yiwei Li","Tianze Yang","Peng Shu","Shaochen Xu","Haixing Dai","Lin Zhao","Gengchen Mai","Ninghao Liu","Tianming Liu"],"pdf_url":"https://arxiv.org/pdf/2401.11641v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2206.14358v2","updated":"2024-01-22T00:38:08Z","published":"2022-06-29T01:57:44Z","title":"Using Twitter Data to Understand Public Perceptions of Approved versus\n Off-label Use for COVID-19-related Medications","summary":" Understanding public discourse on emergency use of unproven therapeutics is\ncrucial for monitoring safe use and combating misinformation. We developed a\nnatural language processing-based pipeline to comprehend public perceptions of\nand stances on coronavirus disease 2019 (COVID-19)-related drugs on Twitter\nover time. This retrospective study included 609,189 US-based tweets from\nJanuary 29, 2020, to November 30, 2021, about four drugs that garnered\nsignificant public attention during the COVID-19 pandemic: (1)\nHydroxychloroquine and Ivermectin, therapies with anecdotal evidence; and (2)\nMolnupiravir and Remdesivir, FDA-approved treatments for eligible patients.\nTime-trend analysis was employed to understand popularity trends and related\nevents. Content and demographic analyses were conducted to explore potential\nrationales behind people's stances on each drug. Time-trend analysis indicated\nthat Hydroxychloroquine and Ivermectin were discussed more than Molnupiravir\nand Remdesivir, particularly during COVID-19 surges. Hydroxychloroquine and\nIvermectin discussions were highly politicized, related to conspiracy theories,\nhearsay, and celebrity influences. The distribution of stances between the two\nmajor US political parties was significantly different (P < .001); Republicans\nwere more likely to support Hydroxychloroquine (55%) and Ivermectin (30%) than\nDemocrats. People with healthcare backgrounds tended to oppose\nHydroxychloroquine (7%) more than the general population, while the general\npopulation was more likely to support Ivermectin (14%). Our study found that\nsocial media users have varying perceptions and stances on off-label versus\nFDA-authorized drug use at different stages of COVID-19. This indicates that\nhealth systems, regulatory agencies, and policymakers should design tailored\nstrategies to monitor and reduce misinformation to promote safe drug use.\n","authors":["Yining Hua","Hang Jiang","Shixu Lin","Jie Yang","Joseph M. Plasek","David W. Bates","Li Zhou"],"pdf_url":"https://arxiv.org/pdf/2206.14358v2.pdf","comment":"Full paper published in JAMIA"},{"id":"http://arxiv.org/abs/2306.16001v2","updated":"2024-01-22T00:27:45Z","published":"2023-06-28T08:20:35Z","title":"Streamlining Social Media Information Extraction for Public Health\n Research with Deep Learning","summary":" Objective: Social media-based public health research is crucial for epidemic\nsurveillance, but most studies identify relevant corpora with keyword matching.\nThis study develops a system to streamline the process of curating colloquial\nmedical dictionaries. We demonstrate the pipeline by curating a UMLS-colloquial\nsymptom dictionary from COVID-19-related tweets as proof of concept. Methods:\nCOVID-19-related tweets from February 1, 2020, to April 30, 2022 were used. The\npipeline includes three modules: a named entity recognition module to detect\nsymptoms in tweets; an entity normalization module to aggregate detected\nentities; and a mapping module that iteratively maps entities to Unified\nMedical Language System concepts. A random 500 entity sample were drawn from\nthe final dictionary for accuracy validation. Additionally, we conducted a\nsymptom frequency distribution analysis to compare our dictionary to a\npre-defined lexicon from previous research. Results: We identified 498,480\nunique symptom entity expressions from the tweets. Pre-processing reduces the\nnumber to 18,226. The final dictionary contains 38,175 unique expressions of\nsymptoms that can be mapped to 966 UMLS concepts (accuracy = 95%). Symptom\ndistribution analysis found that our dictionary detects more symptoms and is\neffective at identifying psychiatric disorders like anxiety and depression,\noften missed by pre-defined lexicons. Conclusion: This study advances public\nhealth research by implementing a novel, systematic pipeline for curating\nsymptom lexicons from social media data. The final lexicon's high accuracy,\nvalidated by medical professionals, underscores the potential of this\nmethodology to reliably interpret and categorize vast amounts of unstructured\nsocial media data into actionable medical insights across diverse linguistic\nand regional landscapes.\n","authors":["Yining Hua","Shixu Lin","Minghui Li","Yujie Zhang","Dinah Foer","Siwen Wang","Peilin Zhou","Li Zhou","Jie Yang"],"pdf_url":"https://arxiv.org/pdf/2306.16001v2.pdf","comment":"Updated full paper. Abstract presented at IEEE ICHI 2023 and AMIA\n Annual Symposium 2023"},{"id":"http://arxiv.org/abs/2401.12413v1","updated":"2024-01-22T23:55:00Z","published":"2024-01-22T23:55:00Z","title":"How Far Can 100 Samples Go? Unlocking Overall Zero-Shot Multilingual\n Translation via Tiny Multi-Parallel Data","summary":" Zero-shot translation is an open problem, aiming to translate between\nlanguage pairs unseen during training in Multilingual Machine Translation\n(MMT). A common, albeit resource-consuming, solution is to mine as many\ntranslation directions as possible to add to the parallel corpus. In this\npaper, we show that the zero-shot capability of an English-centric model can be\neasily enhanced by fine-tuning with a very small amount of multi-parallel data.\nFor example, on the EC30 dataset, we show that up to +21.7 ChrF non-English\noverall improvements (870 directions) can be achieved by using only 100\nmulti-parallel samples, meanwhile preserving capability in English-centric\ndirections. We further study the size effect of fine-tuning data and its\ntransfer capabilities. Surprisingly, our empirical analysis shows that\ncomparable overall improvements can be achieved even through fine-tuning in a\nsmall, randomly sampled direction set (10\\%). Also, the resulting non-English\nperformance is quite close to the upper bound (complete translation). Due to\nits high efficiency and practicality, we encourage the community 1) to consider\nthe use of the fine-tuning method as a strong baseline for zero-shot\ntranslation and 2) to construct more comprehensive and high-quality\nmulti-parallel data to cover real-world demand.\n","authors":["Di Wu","Shaomu Tan","Yan Meng","David Stap","Christof Monz"],"pdf_url":"https://arxiv.org/pdf/2401.12413v1.pdf","comment":"15 pages, 5 figures"},{"id":"http://arxiv.org/abs/2401.12406v1","updated":"2024-01-22T23:35:09Z","published":"2024-01-22T23:35:09Z","title":"Enhancing In-context Learning via Linear Probe Calibration","summary":" In-context learning (ICL) is a new paradigm for natural language processing\nthat utilizes Generative Pre-trained Transformer (GPT)-like models. This\napproach uses prompts that include in-context demonstrations to generate the\ncorresponding output for a new query input. However, applying ICL in real cases\ndoes not scale with the number of samples, and lacks robustness to different\nprompt templates and demonstration permutations. In this paper, we first show\nthat GPT-like models using ICL result in unreliable predictions based on a new\nmetric based on Shannon entropy. Then, to solve this problem, we propose a new\ntechnique called the Linear Probe Calibration (LinC), a method that calibrates\nthe model's output probabilities, resulting in reliable predictions and\nimproved performance, while requiring only minimal additional samples (as few\nas five labeled data samples). LinC significantly enhances the ICL test\nperformance of GPT models on various benchmark datasets, with an average\nimprovement of up to 21%, and up to a 50% improvement in some cases, and\nsignificantly boosts the performance of PEFT methods, especially in the low\nresource regime. Moreover, LinC achieves lower expected calibration error, and\nis highly robust to varying label proportions, prompt templates, and\ndemonstration permutations. Our code is available at\n\\url{https://github.com/mominabbass/LinC}.\n","authors":["Momin Abbas","Yi Zhou","Parikshit Ram","Nathalie Baracaldo","Horst Samulowitz","Theodoros Salonidis","Tianyi Chen"],"pdf_url":"https://arxiv.org/pdf/2401.12406v1.pdf","comment":"Accepted at AISTATS2024"},{"id":"http://arxiv.org/abs/2309.08007v2","updated":"2024-01-22T23:05:55Z","published":"2023-09-14T19:33:27Z","title":"DiariST: Streaming Speech Translation with Speaker Diarization","summary":" End-to-end speech translation (ST) for conversation recordings involves\nseveral under-explored challenges such as speaker diarization (SD) without\naccurate word time stamps and handling of overlapping speech in a streaming\nfashion. In this work, we propose DiariST, the first streaming ST and SD\nsolution. It is built upon a neural transducer-based streaming ST system and\nintegrates token-level serialized output training and t-vector, which were\noriginally developed for multi-talker speech recognition. Due to the absence of\nevaluation benchmarks in this area, we develop a new evaluation dataset,\nDiariST-AliMeeting, by translating the reference Chinese transcriptions of the\nAliMeeting corpus into English. We also propose new metrics, called\nspeaker-agnostic BLEU and speaker-attributed BLEU, to measure the ST quality\nwhile taking SD accuracy into account. Our system achieves a strong ST and SD\ncapability compared to offline systems based on Whisper, while performing\nstreaming inference for overlapping speech. To facilitate the research in this\nnew direction, we release the evaluation data, the offline baseline systems,\nand the evaluation code.\n","authors":["Mu Yang","Naoyuki Kanda","Xiaofei Wang","Junkun Chen","Peidong Wang","Jian Xue","Jinyu Li","Takuya Yoshioka"],"pdf_url":"https://arxiv.org/pdf/2309.08007v2.pdf","comment":"Accepted to ICASSP 2024"},{"id":"http://arxiv.org/abs/2401.12382v1","updated":"2024-01-22T22:16:55Z","published":"2024-01-22T22:16:55Z","title":"Longitudinal Sentiment Classification of Reddit Posts","summary":" We report results of a longitudinal sentiment classification of Reddit posts\nwritten by students of four major Canadian universities. We work with the texts\nof the posts, concentrating on the years 2020-2023. By finely tuning a\nsentiment threshold to a range of [-0.075,0.075], we successfully built\nclassifiers proficient in categorizing post sentiments into positive and\nnegative categories. Noticeably, our sentiment classification results are\nconsistent across the four university data sets.\n","authors":["Fabian Nwaoha","Ziyad Gaffar","Ho Joon Chun","Marina Sokolova"],"pdf_url":"https://arxiv.org/pdf/2401.12382v1.pdf","comment":"11 pages, 10 figures, 4 tables"},{"id":"http://arxiv.org/abs/2310.00737v3","updated":"2024-01-22T22:12:05Z","published":"2023-10-01T17:25:56Z","title":"GenAI Against Humanity: Nefarious Applications of Generative Artificial\n Intelligence and Large Language Models","summary":" Generative Artificial Intelligence (GenAI) and Large Language Models (LLMs)\nare marvels of technology; celebrated for their prowess in natural language\nprocessing and multimodal content generation, they promise a transformative\nfuture. But as with all powerful tools, they come with their shadows. Picture\nliving in a world where deepfakes are indistinguishable from reality, where\nsynthetic identities orchestrate malicious campaigns, and where targeted\nmisinformation or scams are crafted with unparalleled precision. Welcome to the\ndarker side of GenAI applications. This article is not just a journey through\nthe meanders of potential misuse of GenAI and LLMs, but also a call to\nrecognize the urgency of the challenges ahead. As we navigate the seas of\nmisinformation campaigns, malicious content generation, and the eerie creation\nof sophisticated malware, we'll uncover the societal implications that ripple\nthrough the GenAI revolution we are witnessing. From AI-powered botnets on\nsocial media platforms to the unnerving potential of AI to generate fabricated\nidentities, or alibis made of synthetic realities, the stakes have never been\nhigher. The lines between the virtual and the real worlds are blurring, and the\nconsequences of potential GenAI's nefarious applications impact us all. This\narticle serves both as a synthesis of rigorous research presented on the risks\nof GenAI and misuse of LLMs and as a thought-provoking vision of the different\ntypes of harmful GenAI applications we might encounter in the near future, and\nsome ways we can prepare for them.\n","authors":["Emilio Ferrara"],"pdf_url":"https://arxiv.org/pdf/2310.00737v3.pdf","comment":"Accepted in: Journal of Computational Social Science"},{"id":"http://arxiv.org/abs/2401.12375v1","updated":"2024-01-22T21:59:00Z","published":"2024-01-22T21:59:00Z","title":"Development of an NLP-driven computer-based test guide for visually\n impaired students","summary":" In recent years, advancements in Natural Language Processing (NLP) techniques\nhave revolutionized the field of accessibility and exclusivity of testing,\nparticularly for visually impaired students (VIS). CBT has shown in years back\nits relevance in terms of administering exams electronically, making the test\nprocess easier, providing quicker and more accurate results, and offering\ngreater flexibility and accessibility for candidates. Yet, its relevance was\nnot felt by the visually impaired students as they cannot access printed\ndocuments. Hence, in this paper, we present an NLP-driven Computer-Based Test\nguide for visually impaired students. It employs a speech technology\npre-trained methods to provide real-time assistance and support to visually\nimpaired students. The system utilizes NLP technologies to convert the\ntext-based questions and the associated options in a machine-readable format.\nSubsequently, the speech technology pre-trained model processes the converted\ntext enabling the VIS to comprehend and analyze the content. Furthermore, we\nvalidated that this pre-trained model is not perverse by testing for accuracy\nusing sample audio datasets labels (A, B, C, D, E, F, G) to compare with the\nvoice recordings obtained from 20 VIS which is been predicted by the system to\nattain values for precision, recall, and F1-scores. These metrics are used to\nassess the performance of the pre-trained model and have indicated that it is\nproficient enough to give its better performance to the evaluated system. The\nmethodology adopted for this system is Object Oriented Analysis and Design\nMethodology (OOADM) where Objects are discussed and built by modeling\nreal-world instances.\n","authors":["Tubo Faustinah Nemieboka","Ikechukwu E. Onyenwe","Doris C. Asogwa"],"pdf_url":"https://arxiv.org/pdf/2401.12375v1.pdf","comment":"10 pages, 6 figures"},{"id":"http://arxiv.org/abs/2305.14259v4","updated":"2024-01-22T20:47:51Z","published":"2023-05-23T17:12:08Z","title":"Learning to Generate Novel Scientific Directions with Contextualized\n Literature-based Discovery","summary":" Literature-Based Discovery (LBD) aims to discover new scientific knowledge by\nmining papers and generating hypotheses. Standard LBD is limited to predicting\npairwise relations between discrete concepts (e.g., drug-disease links), and\nignores critical contexts like experimental settings (e.g., a specific patient\npopulation where a drug is evaluated) and background motivations (e.g., to find\ndrugs without specific side effects). We address these limitations with a novel\nformulation of contextualized-LBD (C-LBD): generating scientific hypotheses in\nnatural language, while grounding them in a context that controls the\nhypothesis search space. We present a modeling framework using retrieval of\n``inspirations'' from past scientific papers. Our evaluations reveal that GPT-4\ntends to generate ideas with overall low technical depth and novelty, while our\ninspiration prompting approaches partially mitigate this issue. Our work\nrepresents a first step toward building language models that generate new ideas\nderived from scientific literature.\n","authors":["Qingyun Wang","Doug Downey","Heng Ji","Tom Hope"],"pdf_url":"https://arxiv.org/pdf/2305.14259v4.pdf","comment":"25 pages. Code and resource is available at\n https://github.com/EagleW/CLBD"},{"id":"http://arxiv.org/abs/2401.12343v1","updated":"2024-01-22T20:17:06Z","published":"2024-01-22T20:17:06Z","title":"Subgraph Extraction-based Feedback-guided Iterative Scheduling for HLS","summary":" This paper proposes ISDC, a novel feedback-guided iterative system of\ndifference constraints (SDC) scheduling algorithm for high-level synthesis\n(HLS). ISDC leverages subgraph extraction-based low-level feedback from\ndownstream tools like logic synthesizers to iteratively refine HLS scheduling.\nTechnical innovations include: (1) An enhanced SDC formulation that effectively\nintegrates low-level feedback into the linear-programming (LP) problem; (2) A\nfanout and window-based subgraph extraction mechanism driving the feedback\ncycle; (3) A no-human-in-loop ISDC flow compatible with a wide range of\ndownstream tools and process design kits (PDKs). Evaluation shows that ISDC\nreduces register usage by 28.5% against an industrial-strength open-source HLS\ntool.\n","authors":["Hanchen Ye","David Z. Pan","Chris Leary","Deming Chen","Xiaoqing Xu"],"pdf_url":"https://arxiv.org/pdf/2401.12343v1.pdf","comment":"DATE'24"},{"id":"http://arxiv.org/abs/2401.12326v1","updated":"2024-01-22T19:39:05Z","published":"2024-01-22T19:39:05Z","title":"Fine-tuning Large Language Models for Multigenerator, Multidomain, and\n Multilingual Machine-Generated Text Detection","summary":" SemEval-2024 Task 8 introduces the challenge of identifying machine-generated\ntexts from diverse Large Language Models (LLMs) in various languages and\ndomains. The task comprises three subtasks: binary classification in\nmonolingual and multilingual (Subtask A), multi-class classification (Subtask\nB), and mixed text detection (Subtask C). This paper focuses on Subtask A & B.\nEach subtask is supported by three datasets for training, development, and\ntesting. To tackle this task, two methods: 1) using traditional machine\nlearning (ML) with natural language preprocessing (NLP) for feature extraction,\nand 2) fine-tuning LLMs for text classification. The results show that\ntransformer models, particularly LoRA-RoBERTa, exceed traditional ML methods in\neffectiveness, with majority voting being particularly effective in\nmultilingual contexts for identifying machine-generated texts.\n","authors":["Feng Xiong","Thanet Markchom","Ziwei Zheng","Subin Jung","Varun Ojha","Huizhi Liang"],"pdf_url":"https://arxiv.org/pdf/2401.12326v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.08919v2","updated":"2024-01-22T19:07:07Z","published":"2024-01-17T02:04:59Z","title":"Partial Diacritization: A Context-Contrastive Inference Approach","summary":" Diacritization plays a pivotal role in improving readability and\ndisambiguating the meaning of Arabic texts. Efforts have so far focused on\nmarking every eligible character (Full Diacritization). Comparatively\noverlooked, Partial Diacritzation (PD) is the selection of a subset of\ncharacters to be marked to aid comprehension where needed. Research has\nindicated that excessive diacritic marks can hinder skilled readers--reducing\nreading speed and accuracy. We conduct a behavioral experiment and show that\npartially marked text is often easier to read than fully marked text, and\nsometimes easier than plain text. In this light, we introduce\nContext-Contrastive Partial Diacritization (CCPD)--a novel approach to PD which\nintegrates seamlessly with existing Arabic diacritization systems. CCPD\nprocesses each word twice, once with context and once without, and diacritizes\nonly the characters with disparities between the two inferences. Further, we\nintroduce novel indicators for measuring partial diacritization quality (SR,\nPDER, HDER, ERE), essential for establishing this as a machine learning task.\nLastly, we introduce TD2, a Transformer-variant of an established model which\noffers a markedly different performance profile on our proposed indicators\ncompared to all other known systems.\n","authors":["Muhammad ElNokrashy","Badr AlKhamissi"],"pdf_url":"https://arxiv.org/pdf/2401.08919v2.pdf","comment":"13 equations, 5 tables, 5 figures"},{"id":"http://arxiv.org/abs/2401.12295v1","updated":"2024-01-22T19:00:11Z","published":"2024-01-22T19:00:11Z","title":"Cheap Learning: Maximising Performance of Language Models for Social\n Data Science Using Minimal Data","summary":" The field of machine learning has recently made significant progress in\nreducing the requirements for labelled training data when building new models.\nThese `cheaper' learning techniques hold significant potential for the social\nsciences, where development of large labelled training datasets is often a\nsignificant practical impediment to the use of machine learning for analytical\ntasks. In this article we review three `cheap' techniques that have developed\nin recent years: weak supervision, transfer learning and prompt engineering.\nFor the latter, we also review the particular case of zero-shot prompting of\nlarge language models. For each technique we provide a guide of how it works\nand demonstrate its application across six different realistic social science\napplications (two different tasks paired with three different dataset makeups).\nWe show good performance for all techniques, and in particular we demonstrate\nhow prompting of large language models can achieve high accuracy at very low\ncost. Our results are accompanied by a code repository to make it easy for\nothers to duplicate our work and use it in their own research. Overall, our\narticle is intended to stimulate further uptake of these techniques in the\nsocial sciences.\n","authors":["Leonardo Castro-Gonzalez","Yi-Ling Chung","Hannak Rose Kirk","John Francis","Angus R. Williams","Pica Johansson","Jonathan Bright"],"pdf_url":"https://arxiv.org/pdf/2401.12295v1.pdf","comment":"39 pages, 10 figures, 6 tables"},{"id":"http://arxiv.org/abs/2401.12292v1","updated":"2024-01-22T19:00:08Z","published":"2024-01-22T19:00:08Z","title":"GRATH: Gradual Self-Truthifying for Large Language Models","summary":" Truthfulness is paramount for large language models (LLMs) as they are\nincreasingly deployed in real-world applications. However, existing LLMs still\nstruggle with generating truthful answers and content, as evidenced by their\nmodest performance on benchmarks like TruthfulQA. To address this issue, we\npropose GRAdual self-truTHifying (GRATH), a novel post-processing method to\nenhance truthfulness of LLMs. GRATH utilizes out-of-domain question prompts to\ngenerate corresponding answers and adaptively optimizes the model via direct\npreference optimization (DPO). Note that during this process, GRATH learns\ntruthfulness in a self-supervised manner without requiring annotated answers.\nIn particular, GRATH first generates pairwise truthfulness training data by\nprompting the LLM itself, with each pair containing a question and its correct\nand incorrect answers. The model is then fine-tuned using DPO to learn from the\ndifference between answer pairs. Subsequently, GRATH iteratively refines the\ntruthfulness data and optimizes the model, leading to a gradual improvement in\nmodel truthfulness. Empirically, we evaluate GRATH using different 7B-LLMs and\ncompare with LLMs with similar or even larger sizes on benchmark datasets. Our\nresults show that GRATH effectively improves LLMs' truthfulness without\ncompromising other core capabilities. Notably, GRATH achieves state-of-the-art\nperformance on TruthfulQA, with MC1 accuracy as 54.71% and MC2 accuracy as\n69.10%, which even surpass those on larger-scale models, such as\nLlama2-Chat-70B, by 23.62% and 24.18%, respectively.\n","authors":["Weixin Chen","Bo Li"],"pdf_url":"https://arxiv.org/pdf/2401.12292v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12273v1","updated":"2024-01-22T17:11:37Z","published":"2024-01-22T17:11:37Z","title":"The Ethics of Interaction: Mitigating Security Threats in LLMs","summary":" This paper comprehensively explores the ethical challenges arising from\nsecurity threats to Language Learning Models (LLMs). These intricate digital\nrepositories are increasingly integrated into our daily lives, making them\nprime targets for attacks that can compromise their training data and the\nconfidentiality of their data sources. The paper delves into the nuanced\nethical repercussions of such security threats on society and individual\nprivacy. We scrutinize five major threats: prompt injection, jailbreaking,\nPersonal Identifiable Information (PII) exposure, sexually explicit content,\nand hate based content, going beyond mere identification to assess their\ncritical ethical consequences and the urgency they create for robust defensive\nstrategies. The escalating reliance on LLMs underscores the crucial need for\nensuring these systems operate within the bounds of ethical norms, particularly\nas their misuse can lead to significant societal and individual harm. We\npropose conceptualizing and developing an evaluative tool tailored for LLMs,\nwhich would serve a dual purpose, guiding developers and designers in\npreemptive fortification of backend systems and scrutinizing the ethical\ndimensions of LLM chatbot responses during the testing phase. By comparing LLM\nresponses with those expected from humans in a moral context, we aim to discern\nthe degree to which AI behaviors align with the ethical values held by a\nbroader society. Ultimately, this paper not only underscores the ethical\ntroubles presented by LLMs, it also highlights a path toward cultivating trust\nin these systems.\n","authors":["Ashutosh Kumar","Sagarika Singh","Shiv Vignesh Murty","Swathy Ragupathy"],"pdf_url":"https://arxiv.org/pdf/2401.12273v1.pdf","comment":null}],"Computer Vision and Pattern Recognition":[{"id":"http://arxiv.org/abs/2401.12217v1","updated":"2024-01-22T18:59:29Z","published":"2024-01-22T18:59:29Z","title":"Exploring Simple Open-Vocabulary Semantic Segmentation","summary":" Open-vocabulary semantic segmentation models aim to accurately assign a\nsemantic label to each pixel in an image from a set of arbitrary\nopen-vocabulary texts. In order to learn such pixel-level alignment, current\napproaches typically rely on a combination of (i) image-level VL model (e.g.\nCLIP), (ii) ground truth masks, and (iii) custom grouping encoders. In this\npaper, we introduce S-Seg, a novel model that can achieve surprisingly strong\nperformance without depending on any of the above elements. S-Seg leverages\npseudo-mask and language to train a MaskFormer, and can be easily trained from\npublicly available image-text datasets. Contrary to prior works, our model\ndirectly trains for pixel-level features and language alignment. Once trained,\nS-Seg generalizes well to multiple testing datasets without requiring\nfine-tuning. In addition, S-Seg has the extra benefits of scalability with data\nand consistently improvement when augmented with self-training. We believe that\nour simple yet effective approach will serve as a solid baseline for future\nresearch.\n","authors":["Zihang Lai"],"pdf_url":"https://arxiv.org/pdf/2401.12217v1.pdf","comment":"Code is available at: https://github.com/zlai0/S-Seg"},{"id":"http://arxiv.org/abs/2401.12215v1","updated":"2024-01-22T18:59:07Z","published":"2024-01-22T18:59:07Z","title":"Less Could Be Better: Parameter-efficient Fine-tuning Advances Medical\n Vision Foundation Models","summary":" Parameter-efficient fine-tuning (PEFT) that was initially developed for\nexploiting pre-trained large language models has recently emerged as an\neffective approach to perform transfer learning on computer vision tasks.\nHowever, the effectiveness of PEFT on medical vision foundation models is still\nunclear and remains to be explored. As a proof of concept, we conducted a\ndetailed empirical study on applying PEFT to chest radiography foundation\nmodels. Specifically, we delved into LoRA, a representative PEFT method, and\ncompared it against full-parameter fine-tuning (FFT) on two self-supervised\nradiography foundation models across three well-established chest radiograph\ndatasets. Our results showed that LoRA outperformed FFT in 13 out of 18\ntransfer learning tasks by at most 2.9% using fewer than 1% tunable parameters.\nCombining LoRA with foundation models, we set up new state-of-the-art on a\nrange of data-efficient learning tasks, such as an AUROC score of 80.6% using\n1% labeled data on NIH ChestX-ray14. We hope this study can evoke more\nattention from the community in the use of PEFT for transfer learning on\nmedical imaging tasks. Code and models are available at\nhttps://github.com/RL4M/MED-PEFT.\n","authors":["Chenyu Lian","Hong-Yu Zhou","Yizhou Yu","Liansheng Wang"],"pdf_url":"https://arxiv.org/pdf/2401.12215v1.pdf","comment":"Technical report"},{"id":"http://arxiv.org/abs/2310.00647v2","updated":"2024-01-22T18:53:48Z","published":"2023-10-01T12:02:59Z","title":"Beyond Task Performance: Evaluating and Reducing the Flaws of Large\n Multimodal Models with In-Context Learning","summary":" Following the success of Large Language Models (LLMs), Large Multimodal\nModels (LMMs), such as the Flamingo model and its subsequent competitors, have\nstarted to emerge as natural steps towards generalist agents. However,\ninteracting with recent LMMs reveals major limitations that are hardly captured\nby the current evaluation benchmarks. Indeed, task performances (e.g., VQA\naccuracy) alone do not provide enough clues to understand their real\ncapabilities, limitations, and to which extent such models are aligned to human\nexpectations. To refine our understanding of those flaws, we deviate from the\ncurrent evaluation paradigm, and (1) evaluate 10 recent open-source LMMs from\n3B up to 80B parameter scale, on 5 different axes; hallucinations, abstention,\ncompositionality, explainability and instruction following. Our evaluation on\nthese axes reveals major flaws in LMMs. While the current go-to solution to\nalign these models is based on training, such as instruction tuning or RLHF, we\nrather (2) explore the training-free in-context learning (ICL) as a solution,\nand study how it affects these limitations. Based on our ICL study, (3) we push\nICL further and propose new multimodal ICL variants such as; Multitask-ICL,\nChain-of-Hindsight-ICL, and Self-Correcting-ICL. Our findings are as follows.\n(1) Despite their success, LMMs have flaws that remain unsolved with scaling\nalone. (2) The effect of ICL on LMMs flaws is nuanced; despite its\neffectiveness for improved explainability, answer abstention, ICL only slightly\nimproves instruction following, does not improve compositional abilities, and\nactually even amplifies hallucinations. (3) The proposed ICL variants are\npromising as post-hoc approaches to efficiently tackle some of those flaws. The\ncode is available here: https://github.com/mshukor/EvALign-ICL.\n","authors":["Mustafa Shukor","Alexandre Rame","Corentin Dancette","Matthieu Cord"],"pdf_url":"https://arxiv.org/pdf/2310.00647v2.pdf","comment":"ICLR 2024. Project Page: https://evalign-icl.github.io/"},{"id":"http://arxiv.org/abs/2401.12210v1","updated":"2024-01-22T18:52:51Z","published":"2024-01-22T18:52:51Z","title":"Connecting the Dots: Leveraging Spatio-Temporal Graph Neural Networks\n for Accurate Bangla Sign Language Recognition","summary":" Recent advances in Deep Learning and Computer Vision have been successfully\nleveraged to serve marginalized communities in various contexts. One such area\nis Sign Language - a primary means of communication for the deaf community.\nHowever, so far, the bulk of research efforts and investments have gone into\nAmerican Sign Language, and research activity into low-resource sign languages\n- especially Bangla Sign Language - has lagged significantly. In this research\npaper, we present a new word-level Bangla Sign Language dataset - BdSL40 -\nconsisting of 611 videos over 40 words, along with two different approaches:\none with a 3D Convolutional Neural Network model and another with a novel Graph\nNeural Network approach for the classification of BdSL40 dataset. This is the\nfirst study on word-level BdSL recognition, and the dataset was transcribed\nfrom Indian Sign Language (ISL) using the Bangla Sign Language Dictionary\n(1997). The proposed GNN model achieved an F1 score of 89%. The study\nhighlights the significant lexical and semantic similarity between BdSL, West\nBengal Sign Language, and ISL, and the lack of word-level datasets for BdSL in\nthe literature. We release the dataset and source code to stimulate further\nresearch.\n","authors":["Haz Sameen Shahgir","Khondker Salman Sayeed","Md Toki Tahmid","Tanjeem Azwad Zaman","Md. Zarif Ul Alam"],"pdf_url":"https://arxiv.org/pdf/2401.12210v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12208v1","updated":"2024-01-22T18:51:07Z","published":"2024-01-22T18:51:07Z","title":"CheXagent: Towards a Foundation Model for Chest X-Ray Interpretation","summary":" Chest X-rays (CXRs) are the most frequently performed imaging test in\nclinical practice. Recent advances in the development of vision-language\nfoundation models (FMs) give rise to the possibility of performing automated\nCXR interpretation, which can assist physicians with clinical decision-making\nand improve patient outcomes. However, developing FMs that can accurately\ninterpret CXRs is challenging due to the (1) limited availability of\nlarge-scale vision-language datasets in the medical image domain, (2) lack of\nvision and language encoders that can capture the complexities of medical data,\nand (3) absence of evaluation frameworks for benchmarking the abilities of FMs\non CXR interpretation. In this work, we address these challenges by first\nintroducing \\emph{CheXinstruct} - a large-scale instruction-tuning dataset\ncurated from 28 publicly-available datasets. We then present \\emph{CheXagent} -\nan instruction-tuned FM capable of analyzing and summarizing CXRs. To build\nCheXagent, we design a clinical large language model (LLM) for parsing\nradiology reports, a vision encoder for representing CXR images, and a network\nto bridge the vision and language modalities. Finally, we introduce\n\\emph{CheXbench} - a novel benchmark designed to systematically evaluate FMs\nacross 8 clinically-relevant CXR interpretation tasks. Extensive quantitative\nevaluations and qualitative reviews with five expert radiologists demonstrate\nthat CheXagent outperforms previously-developed general- and medical-domain FMs\non CheXbench tasks. Furthermore, in an effort to improve model transparency, we\nperform a fairness evaluation across factors of sex, race and age to highlight\npotential performance disparities. Our project is at\n\\url{https://stanford-aimi.github.io/chexagent.html}.\n","authors":["Zhihong Chen","Maya Varma","Jean-Benoit Delbrouck","Magdalini Paschali","Louis Blankemeier","Dave Van Veen","Jeya Maria Jose Valanarasu","Alaa Youssef","Joseph Paul Cohen","Eduardo Pontes Reis","Emily B. Tsai","Andrew Johnston","Cameron Olsen","Tanishq Mathew Abraham","Sergios Gatidis","Akshay S. Chaudhari","Curtis Langlotz"],"pdf_url":"https://arxiv.org/pdf/2401.12208v1.pdf","comment":"24 pages, 8 figures"},{"id":"http://arxiv.org/abs/2401.12202v1","updated":"2024-01-22T18:42:20Z","published":"2024-01-22T18:42:20Z","title":"OK-Robot: What Really Matters in Integrating Open-Knowledge Models for\n Robotics","summary":" Remarkable progress has been made in recent years in the fields of vision,\nlanguage, and robotics. We now have vision models capable of recognizing\nobjects based on language queries, navigation systems that can effectively\ncontrol mobile systems, and grasping models that can handle a wide range of\nobjects. Despite these advancements, general-purpose applications of robotics\nstill lag behind, even though they rely on these fundamental capabilities of\nrecognition, navigation, and grasping. In this paper, we adopt a systems-first\napproach to develop a new Open Knowledge-based robotics framework called\nOK-Robot. By combining Vision-Language Models (VLMs) for object detection,\nnavigation primitives for movement, and grasping primitives for object\nmanipulation, OK-Robot offers a integrated solution for pick-and-drop\noperations without requiring any training. To evaluate its performance, we run\nOK-Robot in 10 real-world home environments. The results demonstrate that\nOK-Robot achieves a 58.5% success rate in open-ended pick-and-drop tasks,\nrepresenting a new state-of-the-art in Open Vocabulary Mobile Manipulation\n(OVMM) with nearly 1.8x the performance of prior work. On cleaner, uncluttered\nenvironments, OK-Robot's performance increases to 82%. However, the most\nimportant insight gained from OK-Robot is the critical role of nuanced details\nwhen combining Open Knowledge systems like VLMs with robotic modules. Videos of\nour experiments are available on our website: https://ok-robot.github.io\n","authors":["Peiqi Liu","Yaswanth Orru","Chris Paxton","Nur Muhammad Mahi Shafiullah","Lerrel Pinto"],"pdf_url":"https://arxiv.org/pdf/2401.12202v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12198v1","updated":"2024-01-22T18:38:44Z","published":"2024-01-22T18:38:44Z","title":"LONEStar: The Lunar Flashlight Optical Navigation Experiment","summary":" This paper documents the results from the highly successful Lunar flashlight\nOptical Navigation Experiment with a Star tracker (LONEStar). Launched in\nDecember 2022, Lunar Flashlight (LF) was a NASA-funded technology demonstration\nmission. After a propulsion system anomaly prevented capture in lunar orbit, LF\nwas ejected from the Earth-Moon system and into heliocentric space. NASA\nsubsequently transferred ownership of LF to Georgia Tech to conduct an unfunded\nextended mission to demonstrate further advanced technology objectives,\nincluding LONEStar. From August-December 2023, the LONEStar team performed\non-orbit calibration of the optical instrument and a number of different OPNAV\nexperiments. This campaign included the processing of nearly 400 images of star\nfields, Earth and Moon, and four other planets (Mercury, Mars, Jupiter, and\nSaturn). LONEStar provided the first on-orbit demonstrations of heliocentric\nnavigation using only optical observations of planets. Of special note is the\nsuccessful in-flight demonstration of (1) instantaneous triangulation with\nsimultaneous sightings of two planets with the LOST algorithm and (2) dynamic\ntriangulation with sequential sightings of multiple planets.\n","authors":["Michael Krause","Ava Thrasher","Priyal Soni","Liam Smego","Reuben Isaac","Jennifer Nolan","Micah Pledger","E. Glenn Lightsey","W. Jud Ready","John Christian"],"pdf_url":"https://arxiv.org/pdf/2401.12198v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12176v1","updated":"2024-01-22T18:09:15Z","published":"2024-01-22T18:09:15Z","title":"Broiler-Net: A Deep Convolutional Framework for Broiler Behavior\n Analysis in Poultry Houses","summary":" Detecting anomalies in poultry houses is crucial for maintaining optimal\nchicken health conditions, minimizing economic losses and bolstering\nprofitability. This paper presents a novel real-time framework for analyzing\nchicken behavior in cage-free poultry houses to detect abnormal behaviors.\nSpecifically, two significant abnormalities, namely inactive broiler and\nhuddling behavior, are investigated in this study. The proposed framework\ncomprises three key steps: (1) chicken detection utilizing a state-of-the-art\ndeep learning model, (2) tracking individual chickens across consecutive frames\nwith a fast tracker module, and (3) detecting abnormal behaviors within the\nvideo stream. Experimental studies are conducted to evaluate the efficacy of\nthe proposed algorithm in accurately assessing chicken behavior. The results\nillustrate that our framework provides a precise and efficient solution for\nreal-time anomaly detection, facilitating timely interventions to maintain\nchicken health and enhance overall productivity on poultry farms. Github:\nhttps://github.com/TaherehZarratEhsan/Chicken-Behavior-Analysis\n","authors":["Tahereh Zarrat Ehsan","Seyed Mehdi Mohtavipour"],"pdf_url":"https://arxiv.org/pdf/2401.12176v1.pdf","comment":"11 pages, 7 figures"},{"id":"http://arxiv.org/abs/2310.05916v3","updated":"2024-01-22T18:08:52Z","published":"2023-10-09T17:59:04Z","title":"Interpreting CLIP's Image Representation via Text-Based Decomposition","summary":" We investigate the CLIP image encoder by analyzing how individual model\ncomponents affect the final representation. We decompose the image\nrepresentation as a sum across individual image patches, model layers, and\nattention heads, and use CLIP's text representation to interpret the summands.\nInterpreting the attention heads, we characterize each head's role by\nautomatically finding text representations that span its output space, which\nreveals property-specific roles for many heads (e.g. location or shape). Next,\ninterpreting the image patches, we uncover an emergent spatial localization\nwithin CLIP. Finally, we use this understanding to remove spurious features\nfrom CLIP and to create a strong zero-shot image segmenter. Our results\nindicate that a scalable understanding of transformer models is attainable and\ncan be used to repair and improve models.\n","authors":["Yossi Gandelsman","Alexei A. Efros","Jacob Steinhardt"],"pdf_url":"https://arxiv.org/pdf/2310.05916v3.pdf","comment":"Project page and code:\n https://yossigandelsman.github.io/clip_decomposition/"},{"id":"http://arxiv.org/abs/2401.12175v1","updated":"2024-01-22T18:08:22Z","published":"2024-01-22T18:08:22Z","title":"Single-View 3D Human Digitalization with Large Reconstruction Models","summary":" In this paper, we introduce Human-LRM, a single-stage feed-forward Large\nReconstruction Model designed to predict human Neural Radiance Fields (NeRF)\nfrom a single image. Our approach demonstrates remarkable adaptability in\ntraining using extensive datasets containing 3D scans and multi-view capture.\nFurthermore, to enhance the model's applicability for in-the-wild scenarios\nespecially with occlusions, we propose a novel strategy that distills\nmulti-view reconstruction into single-view via a conditional triplane diffusion\nmodel. This generative extension addresses the inherent variations in human\nbody shapes when observed from a single view, and makes it possible to\nreconstruct the full body human from an occluded image. Through extensive\nexperiments, we show that Human-LRM surpasses previous methods by a significant\nmargin on several benchmarks.\n","authors":["Zhenzhen Weng","Jingyuan Liu","Hao Tan","Zhan Xu","Yang Zhou","Serena Yeung-Levy","Jimei Yang"],"pdf_url":"https://arxiv.org/pdf/2401.12175v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12168v1","updated":"2024-01-22T18:01:01Z","published":"2024-01-22T18:01:01Z","title":"SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning\n Capabilities","summary":" Understanding and reasoning about spatial relationships is a fundamental\ncapability for Visual Question Answering (VQA) and robotics. While Vision\nLanguage Models (VLM) have demonstrated remarkable performance in certain VQA\nbenchmarks, they still lack capabilities in 3D spatial reasoning, such as\nrecognizing quantitative relationships of physical objects like distances or\nsize differences. We hypothesize that VLMs' limited spatial reasoning\ncapability is due to the lack of 3D spatial knowledge in training data and aim\nto solve this problem by training VLMs with Internet-scale spatial reasoning\ndata. To this end, we present a system to facilitate this approach. We first\ndevelop an automatic 3D spatial VQA data generation framework that scales up to\n2 billion VQA examples on 10 million real-world images. We then investigate\nvarious factors in the training recipe, including data quality, training\npipeline, and VLM architecture. Our work features the first internet-scale 3D\nspatial reasoning dataset in metric space. By training a VLM on such data, we\nsignificantly enhance its ability on both qualitative and quantitative spatial\nVQA. Finally, we demonstrate that this VLM unlocks novel downstream\napplications in chain-of-thought spatial reasoning and robotics due to its\nquantitative estimation capability. Project website:\nhttps://spatial-vlm.github.io/\n","authors":["Boyuan Chen","Zhuo Xu","Sean Kirmani","Brian Ichter","Danny Driess","Pete Florence","Dorsa Sadigh","Leonidas Guibas","Fei Xia"],"pdf_url":"https://arxiv.org/pdf/2401.12168v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12164v1","updated":"2024-01-22T17:56:07Z","published":"2024-01-22T17:56:07Z","title":"Semi-supervised segmentation of land cover images using nonlinear\n canonical correlation analysis with multiple features and t-SNE","summary":" Image segmentation is a clustering task whereby each pixel is assigned a\ncluster label. Remote sensing data usually consists of multiple bands of\nspectral images in which there exist semantically meaningful land cover\nsubregions, co-registered with other source data such as LIDAR (LIght Detection\nAnd Ranging) data, where available. This suggests that, in order to account for\nspatial correlation between pixels, a feature vector associated with each pixel\nmay be a vectorized tensor representing the multiple bands and a local patch as\nappropriate. Similarly, multiple types of texture features based on a pixel's\nlocal patch would also be beneficial for encoding locally statistical\ninformation and spatial variations, without necessarily labelling pixel-wise a\nlarge amount of ground truth, then training a supervised model, which is\nsometimes impractical. In this work, by resorting to label only a small\nquantity of pixels, a new semi-supervised segmentation approach is proposed.\nInitially, over all pixels, an image data matrix is created in high dimensional\nfeature space. Then, t-SNE projects the high dimensional data onto 3D\nembedding. By using radial basis functions as input features, which use the\nlabelled data samples as centres, to pair with the output class labels, a\nmodified canonical correlation analysis algorithm, referred to as RBF-CCA, is\nintroduced which learns the associated projection matrix via the small labelled\ndata set. The associated canonical variables, obtained for the full image, are\napplied by k-means clustering algorithm. The proposed semi-supervised RBF-CCA\nalgorithm has been implemented on several remotely sensed multispectral images,\ndemonstrating excellent segmentation results.\n","authors":["Hong Wei","James Xiao","Yichao Zhang","Xia Hong"],"pdf_url":"https://arxiv.org/pdf/2401.12164v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12161v1","updated":"2024-01-22T17:55:16Z","published":"2024-01-22T17:55:16Z","title":"Automated facial recognition system using deep learning for pain\n assessment in adults with cerebral palsy","summary":" Background: Pain assessment in individuals with neurological conditions,\nespecially those with limited self-report ability and altered facial\nexpressions, presents challenges. Existing measures, relying on direct\nobservation by caregivers, lack sensitivity and specificity. In cerebral palsy,\npain is a common comorbidity and a reliable evaluation protocol is crucial.\nThus, having an automatic system that recognizes facial expressions could be of\nenormous help when diagnosing pain in this type of patient.\n Objectives: 1) to build a dataset of facial pain expressions in individuals\nwith cerebral palsy, and 2) to develop an automated facial recognition system\nbased on deep learning for pain assessment addressed to this population.\n Methods: Ten neural networks were trained on three pain image databases,\nincluding the UNBC-McMaster Shoulder Pain Expression Archive Database, the\nMultimodal Intensity Pain Dataset, and the Delaware Pain Database.\nAdditionally, a curated dataset (CPPAIN) was created, consisting of 109\npreprocessed facial pain expression images from individuals with cerebral\npalsy, categorized by two physiotherapists using the Facial Action Coding\nSystem observational scale.\n Results: InceptionV3 exhibited promising performance on the CP-PAIN dataset,\nachieving an accuracy of 62.67% and an F1 score of 61.12%. Explainable\nartificial intelligence techniques revealed consistent essential features for\npain identification across models.\n Conclusion: This study demonstrates the potential of deep learning models for\nrobust pain detection in populations with neurological conditions and\ncommunication disabilities. The creation of a larger dataset specific to\ncerebral palsy would further enhance model accuracy, offering a valuable tool\nfor discerning subtle and idiosyncratic pain expressions. The insights gained\ncould extend to other complex neurological conditions.\n","authors":["Álvaro Sabater-Gárriz","F. Xavier Gaya-Morey","José María Buades-Rubio","Cristina Manresa Yee","Pedro Montoya","Inmaculada Riquelme"],"pdf_url":"https://arxiv.org/pdf/2401.12161v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.08573v2","updated":"2024-01-22T17:54:58Z","published":"2024-01-16T18:58:36Z","title":"Benchmarking the Robustness of Image Watermarks","summary":" This paper investigates the weaknesses of image watermarking techniques. We\npresent WAVES (Watermark Analysis Via Enhanced Stress-testing), a novel\nbenchmark for assessing watermark robustness, overcoming the limitations of\ncurrent evaluation methods.WAVES integrates detection and identification tasks,\nand establishes a standardized evaluation protocol comprised of a diverse range\nof stress tests. The attacks in WAVES range from traditional image distortions\nto advanced and novel variations of diffusive, and adversarial attacks. Our\nevaluation examines two pivotal dimensions: the degree of image quality\ndegradation and the efficacy of watermark detection after attacks. We develop a\nseries of Performance vs. Quality 2D plots, varying over several prominent\nimage similarity metrics, which are then aggregated in a heuristically novel\nmanner to paint an overall picture of watermark robustness and attack potency.\nOur comprehensive evaluation reveals previously undetected vulnerabilities of\nseveral modern watermarking algorithms. We envision WAVES as a toolkit for the\nfuture development of robust watermarking systems. The project is available at\nhttps://wavesbench.github.io/\n","authors":["Bang An","Mucong Ding","Tahseen Rabbani","Aakriti Agrawal","Yuancheng Xu","Chenghao Deng","Sicheng Zhu","Abdirisak Mohamed","Yuxin Wen","Tom Goldstein","Furong Huang"],"pdf_url":"https://arxiv.org/pdf/2401.08573v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.02273v4","updated":"2024-01-22T17:37:03Z","published":"2023-07-05T13:17:14Z","title":"Joint Hierarchical Priors and Adaptive Spatial Resolution for Efficient\n Neural Image Compression","summary":" Recently, the performance of neural image compression (NIC) has steadily\nimproved thanks to the last line of study, reaching or outperforming\nstate-of-the-art conventional codecs. Despite significant progress, current NIC\nmethods still rely on ConvNet-based entropy coding, limited in modeling\nlong-range dependencies due to their local connectivity and the increasing\nnumber of architectural biases and priors, resulting in complex underperforming\nmodels with high decoding latency. Motivated by the efficiency investigation of\nthe Tranformer-based transform coding framework, namely SwinT-ChARM, we propose\nto enhance the latter, as first, with a more straightforward yet effective\nTranformer-based channel-wise auto-regressive prior model, resulting in an\nabsolute image compression transformer (ICT). Through the proposed ICT, we can\ncapture both global and local contexts from the latent representations and\nbetter parameterize the distribution of the quantized latents. Further, we\nleverage a learnable scaling module with a sandwich ConvNeXt-based\npre-/post-processor to accurately extract more compact latent codes while\nreconstructing higher-quality images. Extensive experimental results on\nbenchmark datasets showed that the proposed framework significantly improves\nthe trade-off between coding efficiency and decoder complexity over the\nversatile video coding (VVC) reference encoder (VTM-18.0) and the neural codec\nSwinT-ChARM. Moreover, we provide model scaling studies to verify the\ncomputational efficiency of our approach and conduct several objective and\nsubjective analyses to bring to the fore the performance gap between the\nadaptive image compression transformer (AICT) and the neural codec SwinT-ChARM.\n","authors":["Ahmed Ghorbel","Wassim Hamidouche","Luce Morin"],"pdf_url":"https://arxiv.org/pdf/2307.02273v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12133v1","updated":"2024-01-22T17:15:02Z","published":"2024-01-22T17:15:02Z","title":"VRMN-bD: A Multi-modal Natural Behavior Dataset of Immersive Human Fear\n Responses in VR Stand-up Interactive Games","summary":" Understanding and recognizing emotions are important and challenging issues\nin the metaverse era. Understanding, identifying, and predicting fear, which is\none of the fundamental human emotions, in virtual reality (VR) environments\nplays an essential role in immersive game development, scene development, and\nnext-generation virtual human-computer interaction applications. In this\narticle, we used VR horror games as a medium to analyze fear emotions by\ncollecting multi-modal data (posture, audio, and physiological signals) from 23\nplayers. We used an LSTM-based model to predict fear with accuracies of 65.31%\nand 90.47% under 6-level classification (no fear and five different levels of\nfear) and 2-level classification (no fear and fear), respectively. We\nconstructed a multi-modal natural behavior dataset of immersive human fear\nresponses (VRMN-bD) and compared it with existing relevant advanced datasets.\nThe results show that our dataset has fewer limitations in terms of collection\nmethod, data scale and audience scope. We are unique and advanced in targeting\nmulti-modal datasets of fear and behavior in VR stand-up interactive\nenvironments. Moreover, we discussed the implications of this work for\ncommunities and applications. The dataset and pre-trained model are available\nat https://github.com/KindOPSTAR/VRMN-bD.\n","authors":["He Zhang","Xinyang Li","Yuanxi Sun","Xinyi Fu","Christine Qiu","John M. Carroll"],"pdf_url":"https://arxiv.org/pdf/2401.12133v1.pdf","comment":"Accepted to IEEE VR 2024"},{"id":"http://arxiv.org/abs/2401.06144v2","updated":"2024-01-22T17:11:57Z","published":"2023-11-30T23:31:33Z","title":"DFU: scale-robust diffusion model for zero-shot super-resolution image\n generation","summary":" Diffusion generative models have achieved remarkable success in generating\nimages with a fixed resolution. However, existing models have limited ability\nto generalize to different resolutions when training data at those resolutions\nare not available. Leveraging techniques from operator learning, we present a\nnovel deep-learning architecture, Dual-FNO UNet (DFU), which approximates the\nscore operator by combining both spatial and spectral information at multiple\nresolutions. Comparisons of DFU to baselines demonstrate its scalability: 1)\nsimultaneously training on multiple resolutions improves FID over training at\nany single fixed resolution; 2) DFU generalizes beyond its training\nresolutions, allowing for coherent, high-fidelity generation at\nhigher-resolutions with the same model, i.e. zero-shot super-resolution\nimage-generation; 3) we propose a fine-tuning strategy to further enhance the\nzero-shot super-resolution image-generation capability of our model, leading to\na FID of 11.3 at 1.66 times the maximum training resolution on FFHQ, which no\nother method can come close to achieving.\n","authors":["Alex Havrilla","Kevin Rojas","Wenjing Liao","Molei Tao"],"pdf_url":"https://arxiv.org/pdf/2401.06144v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12129v1","updated":"2024-01-22T17:11:01Z","published":"2024-01-22T17:11:01Z","title":"Out-of-Distribution Detection & Applications With Ablated Learned\n Temperature Energy","summary":" As deep neural networks become adopted in high-stakes domains, it is crucial\nto be able to identify when inference inputs are Out-of-Distribution (OOD) so\nthat users can be alerted of likely drops in performance and calibration\ndespite high confidence. Among many others, existing methods use the following\ntwo scores to do so without training on any apriori OOD examples: a learned\ntemperature and an energy score. In this paper we introduce Ablated Learned\nTemperature Energy (or \"AbeT\" for short), a method which combines these prior\nmethods in novel ways with effective modifications. Due to these contributions,\nAbeT lowers the False Positive Rate at $95\\%$ True Positive Rate (FPR@95) by\n$35.39\\%$ in classification (averaged across all ID and OOD datasets measured)\ncompared to state of the art without training networks in multiple stages or\nrequiring hyperparameters or test-time backward passes. We additionally provide\nempirical insights as to how our model learns to distinguish between\nIn-Distribution (ID) and OOD samples while only being explicitly trained on ID\nsamples via exposure to misclassified ID examples at training time. Lastly, we\nshow the efficacy of our method in identifying predicted bounding boxes and\npixels corresponding to OOD objects in object detection and semantic\nsegmentation, respectively - with an AUROC increase of $5.15\\%$ in object\ndetection and both a decrease in FPR@95 of $41.48\\%$ and an increase in AUPRC\nof $34.20\\%$ on average in semantic segmentation compared to previous state of\nthe art.\n","authors":["Will LeVine","Benjamin Pikus","Jacob Phillips","Berk Norman","Fernando Amat Gil","Sean Hendryx"],"pdf_url":"https://arxiv.org/pdf/2401.12129v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.00454v2","updated":"2024-01-22T17:10:49Z","published":"2023-09-30T18:13:41Z","title":"UniLVSeg: Unified Left Ventricular Segmentation with Sparsely Annotated\n Echocardiogram Videos through Self-Supervised Temporal Masking and Weakly\n Supervised Training","summary":" Echocardiography has become an indispensable clinical imaging modality for\ngeneral heart health assessment. From calculating biomarkers such as ejection\nfraction to the probability of a patient's heart failure, accurate segmentation\nof the heart and its structures allows doctors to plan and execute treatments\nwith greater precision and accuracy. However, achieving accurate and robust\nleft ventricle segmentation is time-consuming and challenging due to different\nreasons. This work introduces a novel approach for consistent left ventricular\n(LV) segmentation from sparsely annotated echocardiogram videos. We achieve\nthis through (1) self-supervised learning (SSL) using temporal masking followed\nby (2) weakly supervised training. We investigate two different segmentation\napproaches: 3D segmentation and a novel 2D superimage (SI). We demonstrate how\nour proposed method outperforms the state-of-the-art solutions by achieving a\n93.32% (95%CI 93.21-93.43%) dice score on a large-scale dataset\n(EchoNet-Dynamic) while being more efficient. To show the effectiveness of our\napproach, we provide extensive ablation studies, including pre-training\nsettings and various deep learning backbones. Additionally, we discuss how our\nproposed methodology achieves high data utility by incorporating unlabeled\nframes in the training process. To help support the AI in medicine community,\nthe complete solution with the source code will be made publicly available upon\nacceptance.\n","authors":["Fadillah Maani","Asim Ukaye","Nada Saadi","Numan Saeed","Mohammad Yaqub"],"pdf_url":"https://arxiv.org/pdf/2310.00454v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12074v1","updated":"2024-01-22T16:14:26Z","published":"2024-01-22T16:14:26Z","title":"DeepCERES: A Deep learning method for cerebellar lobule segmentation\n using ultra-high resolution multimodal MRI","summary":" This paper introduces a novel multimodal and high-resolution human brain\ncerebellum lobule segmentation method. Unlike current tools that operate at\nstandard resolution ($1 \\text{ mm}^{3}$) or using mono-modal data, the proposed\nmethod improves cerebellum lobule segmentation through the use of a multimodal\nand ultra-high resolution ($0.125 \\text{ mm}^{3}$) training dataset. To develop\nthe method, first, a database of semi-automatically labelled cerebellum lobules\nwas created to train the proposed method with ultra-high resolution T1 and T2\nMR images. Then, an ensemble of deep networks has been designed and developed,\nallowing the proposed method to excel in the complex cerebellum lobule\nsegmentation task, improving precision while being memory efficient. Notably,\nour approach deviates from the traditional U-Net model by exploring alternative\narchitectures. We have also integrated deep learning with classical machine\nlearning methods incorporating a priori knowledge from multi-atlas\nsegmentation, which improved precision and robustness. Finally, a new online\npipeline, named DeepCERES, has been developed to make available the proposed\nmethod to the scientific community requiring as input only a single T1 MR image\nat standard resolution.\n","authors":["Sergio Morell-Ortega","Marina Ruiz-Perez","Marien Gadea","Roberto Vivo-Hernando","Gregorio Rubio","Fernando Aparici","Mariam de la Iglesia-Vaya","Gwenaelle Catheline","Pierrick Coupé","José V. Manjón"],"pdf_url":"https://arxiv.org/pdf/2401.12074v1.pdf","comment":"20 pages"},{"id":"http://arxiv.org/abs/2401.12051v1","updated":"2024-01-22T15:42:21Z","published":"2024-01-22T15:42:21Z","title":"CloSe: A 3D Clothing Segmentation Dataset and Model","summary":" 3D Clothing modeling and datasets play crucial role in the entertainment,\nanimation, and digital fashion industries. Existing work often lacks detailed\nsemantic understanding or uses synthetic datasets, lacking realism and\npersonalization. To address this, we first introduce CloSe-D: a novel\nlarge-scale dataset containing 3D clothing segmentation of 3167 scans, covering\na range of 18 distinct clothing classes. Additionally, we propose CloSe-Net,\nthe first learning-based 3D clothing segmentation model for fine-grained\nsegmentation from colored point clouds. CloSe-Net uses local point features,\nbody-clothing correlation, and a garment-class and point features-based\nattention module, improving performance over baselines and prior work. The\nproposed attention module enables our model to learn appearance and\ngeometry-dependent clothing prior from data. We further validate the efficacy\nof our approach by successfully segmenting publicly available datasets of\npeople in clothing. We also introduce CloSe-T, a 3D interactive tool for\nrefining segmentation labels. Combining the tool with CloSe-T in a continual\nlearning setup demonstrates improved generalization on real-world data.\nDataset, model, and tool can be found at\nhttps://virtualhumans.mpi-inf.mpg.de/close3dv24/.\n","authors":["Dimitrije Antić","Garvita Tiwari","Batuhan Ozcomlekci","Riccardo Marin","Gerard Pons-Moll"],"pdf_url":"https://arxiv.org/pdf/2401.12051v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12048v1","updated":"2024-01-22T15:40:24Z","published":"2024-01-22T15:40:24Z","title":"HomeRobot Open Vocabulary Mobile Manipulation Challenge 2023 Participant\n Report (Team KuzHum)","summary":" We report an improvements to NeurIPS 2023 HomeRobot: Open Vocabulary Mobile\nManipulation (OVMM) Challenge reinforcement learning baseline. More\nspecifically, we propose more accurate semantic segmentation module, along with\nbetter place skill policy, and high-level heuristic that outperforms the\nbaseline by 2.4% of overall success rate (sevenfold improvement) and 8.2% of\npartial success rate (1.75 times improvement) on Test Standard split of the\nchallenge dataset. With aforementioned enhancements incorporated our agent\nscored 3rd place in the challenge on both simulation and real-world stages.\n","authors":["Volodymyr Kuzma","Vladyslav Humennyy","Ruslan Partsey"],"pdf_url":"https://arxiv.org/pdf/2401.12048v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.08865v2","updated":"2024-01-22T15:30:08Z","published":"2024-01-16T22:36:23Z","title":"The Effect of Intrinsic Dataset Properties on Generalization: Unraveling\n Learning Differences Between Natural and Medical Images","summary":" This paper investigates discrepancies in how neural networks learn from\ndifferent imaging domains, which are commonly overlooked when adopting computer\nvision techniques from the domain of natural images to other specialized\ndomains such as medical images. Recent works have found that the generalization\nerror of a trained network typically increases with the intrinsic dimension\n($d_{data}$) of its training set. Yet, the steepness of this relationship\nvaries significantly between medical (radiological) and natural imaging\ndomains, with no existing theoretical explanation. We address this gap in\nknowledge by establishing and empirically validating a generalization scaling\nlaw with respect to $d_{data}$, and propose that the substantial scaling\ndiscrepancy between the two considered domains may be at least partially\nattributed to the higher intrinsic \"label sharpness\" ($K_F$) of medical imaging\ndatasets, a metric which we propose. Next, we demonstrate an additional benefit\nof measuring the label sharpness of a training set: it is negatively correlated\nwith the trained model's adversarial robustness, which notably leads to models\nfor medical images having a substantially higher vulnerability to adversarial\nattack. Finally, we extend our $d_{data}$ formalism to the related metric of\nlearned representation intrinsic dimension ($d_{repr}$), derive a\ngeneralization scaling law with respect to $d_{repr}$, and show that $d_{data}$\nserves as an upper bound for $d_{repr}$. Our theoretical results are supported\nby thorough experiments with six models and eleven natural and medical imaging\ndatasets over a range of training set sizes. Our findings offer insights into\nthe influence of intrinsic dataset properties on generalization, representation\nlearning, and robustness in deep neural networks.\n","authors":["Nicholas Konz","Maciej A. Mazurowski"],"pdf_url":"https://arxiv.org/pdf/2401.08865v2.pdf","comment":"ICLR 2024. Code:\n https://github.com/mazurowski-lab/intrinsic-properties"},{"id":"http://arxiv.org/abs/2401.12039v1","updated":"2024-01-22T15:26:01Z","published":"2024-01-22T15:26:01Z","title":"Look, Listen and Recognise: Character-Aware Audio-Visual Subtitling","summary":" The goal of this paper is automatic character-aware subtitle generation.\nGiven a video and a minimal amount of metadata, we propose an audio-visual\nmethod that generates a full transcript of the dialogue, with precise speech\ntimestamps, and the character speaking identified. The key idea is to first use\naudio-visual cues to select a set of high-precision audio exemplars for each\ncharacter, and then use these exemplars to classify all speech segments by\nspeaker identity. Notably, the method does not require face detection or\ntracking. We evaluate the method over a variety of TV sitcoms, including\nSeinfeld, Fraiser and Scrubs. We envision this system being useful for the\nautomatic generation of subtitles to improve the accessibility of the vast\namount of videos available on modern streaming services. Project page :\n\\url{https://www.robots.ox.ac.uk/~vgg/research/look-listen-recognise/}\n","authors":["Bruno Korbar","Jaesung Huh","Andrew Zisserman"],"pdf_url":"https://arxiv.org/pdf/2401.12039v1.pdf","comment":"Accepted for publication in ICASSP 2024"},{"id":"http://arxiv.org/abs/2401.12033v1","updated":"2024-01-22T15:19:18Z","published":"2024-01-22T15:19:18Z","title":"Momentum-SAM: Sharpness Aware Minimization without Computational\n Overhead","summary":" The recently proposed optimization algorithm for deep neural networks\nSharpness Aware Minimization (SAM) suggests perturbing parameters before\ngradient calculation by a gradient ascent step to guide the optimization into\nparameter space regions of flat loss. While significant generalization\nimprovements and thus reduction of overfitting could be demonstrated, the\ncomputational costs are doubled due to the additionally needed gradient\ncalculation, making SAM unfeasible in case of limited computationally\ncapacities. Motivated by Nesterov Accelerated Gradient (NAG) we propose\nMomentum-SAM (MSAM), which perturbs parameters in the direction of the\naccumulated momentum vector to achieve low sharpness without significant\ncomputational overhead or memory demands over SGD or Adam. We evaluate MSAM in\ndetail and reveal insights on separable mechanisms of NAG, SAM and MSAM\nregarding training optimization and generalization. Code is available at\nhttps://github.com/MarlonBecker/MSAM.\n","authors":["Marlon Becker","Frederick Altrock","Benjamin Risse"],"pdf_url":"https://arxiv.org/pdf/2401.12033v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.09495v3","updated":"2024-01-22T15:05:43Z","published":"2024-01-17T01:33:40Z","title":"IPR-NeRF: Ownership Verification meets Neural Radiance Field","summary":" Neural Radiance Field (NeRF) models have gained significant attention in the\ncomputer vision community in the recent past with state-of-the-art visual\nquality and produced impressive demonstrations. Since then, technopreneurs have\nsought to leverage NeRF models into a profitable business. Therefore, NeRF\nmodels make it worth the risk of plagiarizers illegally copying,\nre-distributing, or misusing those models. This paper proposes a comprehensive\nintellectual property (IP) protection framework for the NeRF model in both\nblack-box and white-box settings, namely IPR-NeRF. In the black-box setting, a\ndiffusion-based solution is introduced to embed and extract the watermark via a\ntwo-stage optimization process. In the white-box setting, a designated digital\nsignature is embedded into the weights of the NeRF model by adopting the sign\nloss objective. Our extensive experiments demonstrate that not only does our\napproach maintain the fidelity (\\ie, the rendering quality) of IPR-NeRF models,\nbut it is also robust against both ambiguity and removal attacks compared to\nprior arts.\n","authors":["Win Kent Ong","Kam Woh Ng","Chee Seng Chan","Yi Zhe Song","Tao Xiang"],"pdf_url":"https://arxiv.org/pdf/2401.09495v3.pdf","comment":"Error on result tabulation for the state of the art method which\n might cause misleading to the readers"},{"id":"http://arxiv.org/abs/2401.12019v1","updated":"2024-01-22T15:05:05Z","published":"2024-01-22T15:05:05Z","title":"Stereo-Matching Knowledge Distilled Monocular Depth Estimation Filtered\n by Multiple Disparity Consistency","summary":" In stereo-matching knowledge distillation methods of the self-supervised\nmonocular depth estimation, the stereo-matching network's knowledge is\ndistilled into a monocular depth network through pseudo-depth maps. In these\nmethods, the learning-based stereo-confidence network is generally utilized to\nidentify errors in the pseudo-depth maps to prevent transferring the errors.\nHowever, the learning-based stereo-confidence networks should be trained with\nground truth (GT), which is not feasible in a self-supervised setting. In this\npaper, we propose a method to identify and filter errors in the pseudo-depth\nmap using multiple disparity maps by checking their consistency without the\nneed for GT and a training process. Experimental results show that the proposed\nmethod outperforms the previous methods and works well on various\nconfigurations by filtering out erroneous areas where the stereo-matching is\nvulnerable, especially such as textureless regions, occlusion boundaries, and\nreflective surfaces.\n","authors":["Woonghyun Ka","Jae Young Lee","Jaehyun Choi","Junmo Kim"],"pdf_url":"https://arxiv.org/pdf/2401.12019v1.pdf","comment":"ICASSP 2024. The first two authors are equally contributed"},{"id":"http://arxiv.org/abs/2401.12014v1","updated":"2024-01-22T15:00:32Z","published":"2024-01-22T15:00:32Z","title":"Robustness to distribution shifts of compressed networks for edge\n devices","summary":" It is necessary to develop efficient DNNs deployed on edge devices with\nlimited computation resources. However, the compressed networks often execute\nnew tasks in the target domain, which is different from the source domain where\nthe original network is trained. It is important to investigate the robustness\nof compressed networks in two types of data distribution shifts: domain shifts\nand adversarial perturbations. In this study, we discover that compressed\nmodels are less robust to distribution shifts than their original networks.\nInterestingly, larger networks are more vulnerable to losing robustness than\nsmaller ones, even when they are compressed to a similar size as the smaller\nnetworks. Furthermore, compact networks obtained by knowledge distillation are\nmuch more robust to distribution shifts than pruned networks. Finally,\npost-training quantization is a reliable method for achieving significant\nrobustness to distribution shifts, and it outperforms both pruned and distilled\nmodels in terms of robustness.\n","authors":["Lulan Shen","Ali Edalati","Brett Meyer","Warren Gross","James J. Clark"],"pdf_url":"https://arxiv.org/pdf/2401.12014v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11841v4","updated":"2024-01-22T14:59:20Z","published":"2023-12-19T04:14:11Z","title":"MixRT: Mixed Neural Representations For Real-Time NeRF Rendering","summary":" Neural Radiance Field (NeRF) has emerged as a leading technique for novel\nview synthesis, owing to its impressive photorealistic reconstruction and\nrendering capability. Nevertheless, achieving real-time NeRF rendering in\nlarge-scale scenes has presented challenges, often leading to the adoption of\neither intricate baked mesh representations with a substantial number of\ntriangles or resource-intensive ray marching in baked representations. We\nchallenge these conventions, observing that high-quality geometry, represented\nby meshes with substantial triangles, is not necessary for achieving\nphotorealistic rendering quality. Consequently, we propose MixRT, a novel NeRF\nrepresentation that includes a low-quality mesh, a view-dependent displacement\nmap, and a compressed NeRF model. This design effectively harnesses the\ncapabilities of existing graphics hardware, thus enabling real-time NeRF\nrendering on edge devices. Leveraging a highly-optimized WebGL-based rendering\nframework, our proposed MixRT attains real-time rendering speeds on edge\ndevices (over 30 FPS at a resolution of 1280 x 720 on a MacBook M1 Pro laptop),\nbetter rendering quality (0.2 PSNR higher in indoor scenes of the Unbounded-360\ndatasets), and a smaller storage size (less than 80% compared to\nstate-of-the-art methods).\n","authors":["Chaojian Li","Bichen Wu","Peter Vajda"," Yingyan"," Lin"],"pdf_url":"https://arxiv.org/pdf/2312.11841v4.pdf","comment":"Accepted by 3DV'24. Project Page: https://licj15.github.io/MixRT/"},{"id":"http://arxiv.org/abs/2312.10105v2","updated":"2024-01-22T14:56:52Z","published":"2023-12-15T04:11:34Z","title":"Forging Tokens for Improved Storage-efficient Training","summary":" Recent advancements in Deep Neural Network (DNN) models have significantly\nimproved performance across computer vision tasks. However, achieving highly\ngeneralizable and high-performing vision models requires extensive datasets,\nleading to large storage requirements. This storage challenge poses a critical\nbottleneck for scaling up vision models. Motivated by the success of discrete\nrepresentations, SeiT proposes to use Vector-Quantized (VQ) feature vectors\n(i.e., tokens) as network inputs for vision classification. However, applying\ntraditional data augmentations to tokens faces challenges due to input domain\nshift. To address this issue, we introduce TokenAdapt and ColorAdapt, simple\nyet effective token-based augmentation strategies. TokenAdapt realigns token\nembedding space for compatibility with spatial augmentations, preserving the\nmodel's efficiency without requiring fine-tuning. Additionally, ColorAdapt\naddresses color-based augmentations for tokens inspired by Adaptive Instance\nNormalization (AdaIN). We evaluate our approach across various scenarios,\nincluding storage-efficient ImageNet-1k classification, fine-grained\nclassification, robustness benchmarks, and ADE-20k semantic segmentation.\nExperimental results demonstrate consistent performance improvement in diverse\nexperiments. Code is available at https://github.com/naver-ai/tokenadapt.\n","authors":["Minhyun Lee","Song Park","Byeongho Heo","Dongyoon Han","Hyunjung Shim"],"pdf_url":"https://arxiv.org/pdf/2312.10105v2.pdf","comment":"First two authors contributed equally"},{"id":"http://arxiv.org/abs/2311.03782v3","updated":"2024-01-22T14:52:14Z","published":"2023-11-07T08:05:09Z","title":"CapST: An Enhanced and Lightweight Model Attribution Approach for\n Synthetic Videos","summary":" Deepfake videos, generated through AI faceswapping techniques, have garnered\nconsiderable attention due to their potential for powerful impersonation\nattacks. While existing research primarily focuses on binary classification to\ndiscern between real and fake videos, however determining the specific\ngeneration model for a fake video is crucial for forensic investigation.\nAddressing this gap, this paper investigates the model attribution problem of\nDeepfake videos from a recently proposed dataset, Deepfakes from Different\nModels (DFDM), derived from various Autoencoder models. The dataset comprises\n6,450 Deepfake videos generated by five distinct models with variations in\nencoder, decoder, intermediate layer, input resolution, and compression ratio.\nThis study formulates Deepfakes model attribution as a multiclass\nclassification task, proposing a segment of VGG19 as a feature extraction\nbackbone, known for its effectiveness in imagerelated tasks, while integrated a\nCapsule Network with a Spatio-Temporal attention mechanism. The Capsule module\ncaptures intricate hierarchies among features for robust identification of\ndeepfake attributes. Additionally, the video-level fusion technique leverages\ntemporal attention mechanisms to handle concatenated feature vectors,\ncapitalizing on inherent temporal dependencies in deepfake videos. By\naggregating insights across frames, our model gains a comprehensive\nunderstanding of video content, resulting in more precise predictions.\nExperimental results on the deepfake benchmark dataset (DFDM) demonstrate the\nefficacy of our proposed method, achieving up to a 4% improvement in accurately\ncategorizing deepfake videos compared to baseline models while demanding fewer\ncomputational resources.\n","authors":["Wasim Ahmad","Yan-Tsung Peng","Yuan-Hao Chang","Gaddisa Olani Ganfure","Sarwar Khan","Sahibzada Adil Shahzad"],"pdf_url":"https://arxiv.org/pdf/2311.03782v3.pdf","comment":"Rejected from jounal and will have to conduct several more\n experiments"},{"id":"http://arxiv.org/abs/2401.12001v1","updated":"2024-01-22T14:52:08Z","published":"2024-01-22T14:52:08Z","title":"Modeling Stereo-Confidence Out of the End-to-End Stereo-Matching Network\n via Disparity Plane Sweep","summary":" We propose a novel stereo-confidence that can be measured externally to\nvarious stereo-matching networks, offering an alternative input modality choice\nof the cost volume for learning-based approaches, especially in safety-critical\nsystems. Grounded in the foundational concepts of disparity definition and the\ndisparity plane sweep, the proposed stereo-confidence method is built upon the\nidea that any shift in a stereo-image pair should be updated in a corresponding\namount shift in the disparity map. Based on this idea, the proposed\nstereo-confidence method can be summarized in three folds. 1) Using the\ndisparity plane sweep, multiple disparity maps can be obtained and treated as a\n3-D volume (predicted disparity volume), like the cost volume is constructed.\n2) One of these disparity maps serves as an anchor, allowing us to define a\ndesirable (or ideal) disparity profile at every spatial point. 3) By comparing\nthe desirable and predicted disparity profiles, we can quantify the level of\nmatching ambiguity between left and right images for confidence measurement.\nExtensive experimental results using various stereo-matching networks and\ndatasets demonstrate that the proposed stereo-confidence method not only shows\ncompetitive performance on its own but also consistent performance improvements\nwhen it is used as an input modality for learning-based stereo-confidence\nmethods.\n","authors":["Jae Young Lee","Woonghyun Ka","Jaehyun Choi","Junmo Kim"],"pdf_url":"https://arxiv.org/pdf/2401.12001v1.pdf","comment":"AAAI 2024. The first two authors contributed equally"},{"id":"http://arxiv.org/abs/2401.11985v1","updated":"2024-01-22T14:38:25Z","published":"2024-01-22T14:38:25Z","title":"Scaling Face Interaction Graph Networks to Real World Scenes","summary":" Accurately simulating real world object dynamics is essential for various\napplications such as robotics, engineering, graphics, and design. To better\ncapture complex real dynamics such as contact and friction, learned simulators\nbased on graph networks have recently shown great promise. However, applying\nthese learned simulators to real scenes comes with two major challenges: first,\nscaling learned simulators to handle the complexity of real world scenes which\ncan involve hundreds of objects each with complicated 3D shapes, and second,\nhandling inputs from perception rather than 3D state information. Here we\nintroduce a method which substantially reduces the memory required to run\ngraph-based learned simulators. Based on this memory-efficient simulation\nmodel, we then present a perceptual interface in the form of editable NeRFs\nwhich can convert real-world scenes into a structured representation that can\nbe processed by graph network simulator. We show that our method uses\nsubstantially less memory than previous graph-based simulators while retaining\ntheir accuracy, and that the simulators learned in synthetic environments can\nbe applied to real world scenes captured from multiple camera angles. This\npaves the way for expanding the application of learned simulators to settings\nwhere only perceptual information is available at inference time.\n","authors":["Tatiana Lopez-Guevara","Yulia Rubanova","William F. Whitney","Tobias Pfaff","Kimberly Stachenfeld","Kelsey R. Allen"],"pdf_url":"https://arxiv.org/pdf/2401.11985v1.pdf","comment":"16 pages, 12 figures"},{"id":"http://arxiv.org/abs/2401.11960v1","updated":"2024-01-22T14:02:56Z","published":"2024-01-22T14:02:56Z","title":"Observation-Guided Meteorological Field Downscaling at Station Scale: A\n Benchmark and a New Method","summary":" Downscaling (DS) of meteorological variables involves obtaining\nhigh-resolution states from low-resolution meteorological fields and is an\nimportant task in weather forecasting. Previous methods based on deep learning\ntreat downscaling as a super-resolution task in computer vision and utilize\nhigh-resolution gridded meteorological fields as supervision to improve\nresolution at specific grid scales. However, this approach has struggled to\nalign with the continuous distribution characteristics of meteorological\nfields, leading to an inherent systematic bias between the downscaled results\nand the actual observations at meteorological stations. In this paper, we\nextend meteorological downscaling to arbitrary scattered station scales,\nestablish a brand new benchmark and dataset, and retrieve meteorological states\nat any given station location from a coarse-resolution meteorological field.\nInspired by data assimilation techniques, we integrate observational data into\nthe downscaling process, providing multi-scale observational priors. Building\non this foundation, we propose a new downscaling model based on hypernetwork\narchitecture, namely HyperDS, which efficiently integrates different\nobservational information into the model training, achieving continuous scale\nmodeling of the meteorological field. Through extensive experiments, our\nproposed method outperforms other specially designed baseline models on\nmultiple surface variables. Notably, the mean squared error (MSE) for wind\nspeed and surface pressure improved by 67% and 19.5% compared to other methods.\nWe will release the dataset and code subsequently.\n","authors":["Zili Liu","Hao Chen","Lei Bai","Wenyuan Li","Keyan Chen","Zhengyi Wang","Wanli Ouyang","Zhengxia Zou","Zhenwei Shi"],"pdf_url":"https://arxiv.org/pdf/2401.11960v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11949v1","updated":"2024-01-22T13:38:24Z","published":"2024-01-22T13:38:24Z","title":"Feature Denoising Diffusion Model for Blind Image Quality Assessment","summary":" Blind Image Quality Assessment (BIQA) aims to evaluate image quality in line\nwith human perception, without reference benchmarks. Currently, deep learning\nBIQA methods typically depend on using features from high-level tasks for\ntransfer learning. However, the inherent differences between BIQA and these\nhigh-level tasks inevitably introduce noise into the quality-aware features. In\nthis paper, we take an initial step towards exploring the diffusion model for\nfeature denoising in BIQA, namely Perceptual Feature Diffusion for IQA\n(PFD-IQA), which aims to remove noise from quality-aware features.\nSpecifically, (i) We propose a {Perceptual Prior Discovery and Aggregation\nmodule to establish two auxiliary tasks to discover potential low-level\nfeatures in images that are used to aggregate perceptual text conditions for\nthe diffusion model. (ii) We propose a Perceptual Prior-based Feature\nRefinement strategy, which matches noisy features to predefined denoising\ntrajectories and then performs exact feature denoising based on text\nconditions. Extensive experiments on eight standard BIQA datasets demonstrate\nthe superior performance to the state-of-the-art BIQA methods, i.e., achieving\nthe PLCC values of 0.935 ( vs. 0.905 in KADID) and 0.922 ( vs. 0.894 in LIVEC).\n","authors":["Xudong Li","Jingyuan Zheng","Runze Hu","Yan Zhang","Ke Li","Yunhang Shen","Xiawu Zheng","Yutao Liu","ShengChuan Zhang","Pingyang Dai","Rongrong Ji"],"pdf_url":"https://arxiv.org/pdf/2401.11949v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11944v1","updated":"2024-01-22T13:34:34Z","published":"2024-01-22T13:34:34Z","title":"CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding\n Benchmark","summary":" As the capabilities of large multimodal models (LMMs) continue to advance,\nevaluating the performance of LMMs emerges as an increasing need. Additionally,\nthere is an even larger gap in evaluating the advanced knowledge and reasoning\nabilities of LMMs in non-English contexts such as Chinese. We introduce CMMMU,\na new Chinese Massive Multi-discipline Multimodal Understanding benchmark\ndesigned to evaluate LMMs on tasks demanding college-level subject knowledge\nand deliberate reasoning in a Chinese context. CMMMU is inspired by and\nstrictly follows the annotation and analysis pattern of MMMU.\n CMMMU includes 12k manually collected multimodal questions from college\nexams, quizzes, and textbooks, covering six core disciplines: Art & Design,\nBusiness, Science, Health & Medicine, Humanities & Social Science, and Tech &\nEngineering, like its companion, MMMU. These questions span 30 subjects and\ncomprise 39 highly heterogeneous image types, such as charts, diagrams, maps,\ntables, music sheets, and chemical structures.\n CMMMU focuses on complex perception and reasoning with domain-specific\nknowledge in the Chinese context. We evaluate 11 open-source LLMs and one\nproprietary GPT-4V(ision). Even GPT-4V only achieves accuracies of 42%,\nindicating a large space for improvement. CMMMU will boost the community to\nbuild the next-generation LMMs towards expert artificial intelligence and\npromote the democratization of LMMs by providing diverse language contexts.\n","authors":["Ge Zhang","Xinrun Du","Bei Chen","Yiming Liang","Tongxu Luo","Tianyu Zheng","Kang Zhu","Yuyang Cheng","Chunpu Xu","Shuyue Guo","Haoran Zhang","Xingwei Qu","Junjie Wang","Ruibin Yuan","Yizhi Li","Zekun Wang","Yudong Liu","Yu-Hsuan Tsai","Fengji Zhang","Chenghua Lin","Wenhao Huang","Wenhu Chen","Jie Fu"],"pdf_url":"https://arxiv.org/pdf/2401.11944v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11943v1","updated":"2024-01-22T13:33:53Z","published":"2024-01-22T13:33:53Z","title":"Benchmarking Large Multimodal Models against Common Corruptions","summary":" This technical report aims to fill a deficiency in the assessment of large\nmultimodal models (LMMs) by specifically examining the self-consistency of\ntheir outputs when subjected to common corruptions. We investigate the\ncross-modal interactions between text, image, and speech, encompassing four\nessential generation tasks: text-to-image, image-to-text, text-to-speech, and\nspeech-to-text. We create a comprehensive benchmark, named MMCBench, that\ncovers more than 100 popular LMMs (totally over 150 model checkpoints). A\nthorough evaluation under common corruptions is critical for practical\ndeployment and facilitates a better understanding of the reliability of\ncutting-edge LMMs. The benchmarking code is available at\nhttps://github.com/sail-sg/MMCBench\n","authors":["Jiawei Zhang","Tianyu Pang","Chao Du","Yi Ren","Bo Li","Min Lin"],"pdf_url":"https://arxiv.org/pdf/2401.11943v1.pdf","comment":"Technical report"},{"id":"http://arxiv.org/abs/2303.07064v3","updated":"2024-01-22T13:26:32Z","published":"2023-03-13T12:38:07Z","title":"A Generalized Multi-Modal Fusion Detection Framework","summary":" LiDAR point clouds have become the most common data source in autonomous\ndriving. However, due to the sparsity of point clouds, accurate and reliable\ndetection cannot be achieved in specific scenarios. Because of their\ncomplementarity with point clouds, images are getting increasing attention.\nAlthough with some success, existing fusion methods either perform hard fusion\nor do not fuse in a direct manner. In this paper, we propose a generic 3D\ndetection framework called MMFusion, using multi-modal features. The framework\naims to achieve accurate fusion between LiDAR and images to improve 3D\ndetection in complex scenes. Our framework consists of two separate streams:\nthe LiDAR stream and the camera stream, which can be compatible with any\nsingle-modal feature extraction network. The Voxel Local Perception Module in\nthe LiDAR stream enhances local feature representation, and then the\nMulti-modal Feature Fusion Module selectively combines feature output from\ndifferent streams to achieve better fusion. Extensive experiments have shown\nthat our framework not only outperforms existing benchmarks but also improves\ntheir detection, especially for detecting cyclists and pedestrians on KITTI\nbenchmarks, with strong robustness and generalization capabilities. Hopefully,\nour work will stimulate more research into multi-modal fusion for autonomous\ndriving tasks.\n","authors":["Leichao Cui","Xiuxian Li","Min Meng","Xiaoyu Mo"],"pdf_url":"https://arxiv.org/pdf/2303.07064v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.15567v3","updated":"2024-01-22T13:17:21Z","published":"2023-07-28T14:04:06Z","title":"Panoptic Scene Graph Generation with Semantics-Prototype Learning","summary":" Panoptic Scene Graph Generation (PSG) parses objects and predicts their\nrelationships (predicate) to connect human language and visual scenes. However,\ndifferent language preferences of annotators and semantic overlaps between\npredicates lead to biased predicate annotations in the dataset, i.e. different\npredicates for same object pairs. Biased predicate annotations make PSG models\nstruggle in constructing a clear decision plane among predicates, which greatly\nhinders the real application of PSG models. To address the intrinsic bias\nabove, we propose a novel framework named ADTrans to adaptively transfer biased\npredicate annotations to informative and unified ones. To promise consistency\nand accuracy during the transfer process, we propose to measure the invariance\nof representations in each predicate class, and learn unbiased prototypes of\npredicates with different intensities. Meanwhile, we continuously measure the\ndistribution changes between each presentation and its prototype, and\nconstantly screen potential biased data. Finally, with the unbiased\npredicate-prototype representation embedding space, biased annotations are\neasily identified. Experiments show that ADTrans significantly improves the\nperformance of benchmark models, achieving a new state-of-the-art performance,\nand shows great generalization and effectiveness on multiple datasets.\n","authors":["Li Li","Wei Ji","Yiming Wu","Mengze Li","You Qin","Lina Wei","Roger Zimmermann"],"pdf_url":"https://arxiv.org/pdf/2307.15567v3.pdf","comment":"AAAI 2024"},{"id":"http://arxiv.org/abs/2310.09126v2","updated":"2024-01-22T13:14:33Z","published":"2023-10-13T14:14:43Z","title":"Physics-guided Noise Neural Proxy for Practical Low-light Raw Image\n Denoising","summary":" Recently, the mainstream practice for training low-light raw image denoising\nmethods has shifted towards employing synthetic data. Noise modeling, which\nfocuses on characterizing the noise distribution of real-world sensors,\nprofoundly influences the effectiveness and practicality of synthetic data.\nCurrently, physics-based noise modeling struggles to characterize the entire\nreal noise distribution, while learning-based noise modeling impractically\ndepends on paired real data. In this paper, we propose a novel strategy:\nlearning the noise model from dark frames instead of paired real data, to break\ndown the data dependency. Based on this strategy, we introduce an efficient\nphysics-guided noise neural proxy (PNNP) to approximate the real-world sensor\nnoise model. Specifically, we integrate physical priors into neural proxies and\nintroduce three efficient techniques: physics-guided noise decoupling (PND),\nphysics-guided proxy model (PPM), and differentiable distribution loss (DDL).\nPND decouples the dark frame into different components and handles different\nlevels of noise flexibly, which reduces the complexity of noise modeling. PPM\nincorporates physical priors to constrain the generated noise, which promotes\nthe accuracy of noise modeling. DDL provides explicit and reliable supervision\nfor noise distribution, which promotes the precision of noise modeling. PNNP\nexhibits powerful potential in characterizing the real noise distribution.\nExtensive experiments on public datasets demonstrate superior performance in\npractical low-light raw image denoising. The code will be available at\n\\url{https://github.com/fenghansen/PNNP}.\n","authors":["Hansen Feng","Lizhi Wang","Yiqi Huang","Yuzhi Wang","Lin Zhu","Hua Huang"],"pdf_url":"https://arxiv.org/pdf/2310.09126v2.pdf","comment":"Under Review"},{"id":"http://arxiv.org/abs/2401.11914v1","updated":"2024-01-22T13:01:35Z","published":"2024-01-22T13:01:35Z","title":"A Saliency Enhanced Feature Fusion based multiscale RGB-D Salient Object\n Detection Network","summary":" Multiscale convolutional neural network (CNN) has demonstrated remarkable\ncapabilities in solving various vision problems. However, fusing features of\ndifferent scales alwaysresults in large model sizes, impeding the application\nof multiscale CNNs in RGB-D saliency detection. In this paper, we propose a\ncustomized feature fusion module, called Saliency Enhanced Feature Fusion\n(SEFF), for RGB-D saliency detection. SEFF utilizes saliency maps of the\nneighboring scales to enhance the necessary features for fusing, resulting in\nmore representative fused features. Our multiscale RGB-D saliency detector uses\nSEFF and processes images with three different scales. SEFF is used to fuse the\nfeatures of RGB and depth images, as well as the features of decoders at\ndifferent scales. Extensive experiments on five benchmark datasets have\ndemonstrated the superiority of our method over ten SOTA saliency detectors.\n","authors":["Rui Huang","Qingyi Zhao","Yan Xing","Sihua Gao","Weifeng Xu","Yuxiang Zhang","Wei Fan"],"pdf_url":"https://arxiv.org/pdf/2401.11914v1.pdf","comment":"Accpeted by 2024 IEEE International Conference on Acoustics, Speech,\n and Signal Processing (ICASSP 2024)"},{"id":"http://arxiv.org/abs/2401.11913v1","updated":"2024-01-22T13:01:28Z","published":"2024-01-22T13:01:28Z","title":"Large receptive field strategy and important feature extraction strategy\n in 3D object detection","summary":" The enhancement of 3D object detection is pivotal for precise environmental\nperception and improved task execution capabilities in autonomous driving.\nLiDAR point clouds, offering accurate depth information, serve as a crucial\ninformation for this purpose. Our study focuses on key challenges in 3D target\ndetection. To tackle the challenge of expanding the receptive field of a 3D\nconvolutional kernel, we introduce the Dynamic Feature Fusion Module (DFFM).\nThis module achieves adaptive expansion of the 3D convolutional kernel's\nreceptive field, balancing the expansion with acceptable computational loads.\nThis innovation reduces operations, expands the receptive field, and allows the\nmodel to dynamically adjust to different object requirements. Simultaneously,\nwe identify redundant information in 3D features. Employing the Feature\nSelection Module (FSM) quantitatively evaluates and eliminates non-important\nfeatures, achieving the separation of output box fitting and feature\nextraction. This innovation enables the detector to focus on critical features,\nresulting in model compression, reduced computational burden, and minimized\ncandidate frame interference. Extensive experiments confirm that both DFFM and\nFSM not only enhance current benchmarks, particularly in small target\ndetection, but also accelerate network performance. Importantly, these modules\nexhibit effective complementarity.\n","authors":["Leichao Cui","Xiuxian Li","Min Meng"],"pdf_url":"https://arxiv.org/pdf/2401.11913v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11902v1","updated":"2024-01-22T12:50:21Z","published":"2024-01-22T12:50:21Z","title":"A Training-Free Defense Framework for Robust Learned Image Compression","summary":" We study the robustness of learned image compression models against\nadversarial attacks and present a training-free defense technique based on\nsimple image transform functions. Recent learned image compression models are\nvulnerable to adversarial attacks that result in poor compression rate, low\nreconstruction quality, or weird artifacts. To address the limitations, we\npropose a simple but effective two-way compression algorithm with random input\ntransforms, which is conveniently applicable to existing image compression\nmodels. Unlike the na\\\"ive approaches, our approach preserves the original\nrate-distortion performance of the models on clean images. Moreover, the\nproposed algorithm requires no additional training or modification of existing\nmodels, making it more practical. We demonstrate the effectiveness of the\nproposed techniques through extensive experiments under multiple compression\nmodels, evaluation metrics, and attack scenarios.\n","authors":["Myungseo Song","Jinyoung Choi","Bohyung Han"],"pdf_url":"https://arxiv.org/pdf/2401.11902v1.pdf","comment":"10 pages and 14 figures"},{"id":"http://arxiv.org/abs/2203.13718v2","updated":"2024-01-22T12:47:52Z","published":"2022-03-25T15:40:44Z","title":"Digital Fingerprinting of Microstructures","summary":" Finding efficient means of fingerprinting microstructural information is a\ncritical step towards harnessing data-centric machine learning approaches. A\nstatistical framework is systematically developed for compressed\ncharacterisation of a population of images, which includes some classical\ncomputer vision methods as special cases. The focus is on materials\nmicrostructure. The ultimate purpose is to rapidly fingerprint sample images in\nthe context of various high-throughput design/make/test scenarios. This\nincludes, but is not limited to, quantification of the disparity between\nmicrostructures for quality control, classifying microstructures, predicting\nmaterials properties from image data and identifying potential processing\nroutes to engineer new materials with specific properties. Here, we consider\nmicrostructure classification and utilise the resulting features over a range\nof related machine learning tasks, namely supervised, semi-supervised, and\nunsupervised learning.\n The approach is applied to two distinct datasets to illustrate various\naspects and some recommendations are made based on the findings. In particular,\nmethods that leverage transfer learning with convolutional neural networks\n(CNNs), pretrained on the ImageNet dataset, are generally shown to outperform\nother methods. Additionally, dimensionality reduction of these CNN-based\nfingerprints is shown to have negligible impact on classification accuracy for\nthe supervised learning approaches considered. In situations where there is a\nlarge dataset with only a handful of images labelled, graph-based label\npropagation to unlabelled data is shown to be favourable over discarding\nunlabelled data and performing supervised learning. In particular, label\npropagation by Poisson learning is shown to be highly effective at low label\nrates.\n","authors":["Michael D. White","Alexander Tarakanov","Christopher P. Race","Philip J. Withers","Kody J. H. Law"],"pdf_url":"https://arxiv.org/pdf/2203.13718v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2207.00067v3","updated":"2024-01-22T12:24:36Z","published":"2022-06-30T19:13:23Z","title":"Rethinking Unsupervised Domain Adaptation for Semantic Segmentation","summary":" Unsupervised domain adaptation (UDA) adapts a model trained on one domain\n(called source) to a novel domain (called target) using only unlabeled data.\nDue to its high annotation cost, researchers have developed many UDA methods\nfor semantic segmentation, which assume no labeled sample is available in the\ntarget domain. We question the practicality of this assumption for two reasons.\nFirst, after training a model with a UDA method, we must somehow verify the\nmodel before deployment. Second, UDA methods have at least a few\nhyper-parameters that need to be determined. The surest solution to these is to\nevaluate the model using validation data, i.e., a certain amount of labeled\ntarget-domain samples. This question about the basic assumption of UDA leads us\nto rethink UDA from a data-centric point of view. Specifically, we assume we\nhave access to a minimum level of labeled data. Then, we ask how much is\nnecessary to find good hyper-parameters of existing UDA methods. We then\nconsider what if we use the same data for supervised training of the same\nmodel, e.g., finetuning. We conducted experiments to answer these questions\nwith popular scenarios, {GTA5, SYNTHIA}$\\rightarrow$Cityscapes. We found that\ni) choosing good hyper-parameters needs only a few labeled images for some UDA\nmethods whereas a lot more for others; and ii) simple finetuning works\nsurprisingly well; it outperforms many UDA methods if only several dozens of\nlabeled images are available.\n","authors":["Zhijie Wang","Masanori Suganuma","Takayuki Okatani"],"pdf_url":"https://arxiv.org/pdf/2207.00067v3.pdf","comment":"Under review in Pattern Recognition Letters"},{"id":"http://arxiv.org/abs/2401.11877v1","updated":"2024-01-22T12:02:40Z","published":"2024-01-22T12:02:40Z","title":"Evaluating the Feasibility of Standard Facial Expression Recognition in\n Individuals with Moderate to Severe Intellectual Disabilities","summary":" Recent research has underscored the increasing preference of users for\nhuman-like interactions with machines. Consequently, facial expression\nrecognition has gained significance as a means of imparting social robots with\nthe capacity to discern the emotional states of users. In this investigation,\nwe assess the suitability of deep learning approaches, known for their\nremarkable performance in this domain, for recognizing facial expressions in\nindividuals with intellectual disabilities, which has not been yet studied in\nthe literature, to the best of our knowledge. To address this objective, we\ntrain a set of twelve distinct convolutional neural networks in different\napproaches, including an ensemble of datasets without individuals with\nintellectual disabilities and a dataset featuring such individuals. Our\nexamination of the outcomes achieved by the various models under distinct\ntraining conditions, coupled with a comprehensive analysis of critical facial\nregions during expression recognition facilitated by explainable artificial\nintelligence techniques, revealed significant distinctions in facial\nexpressions between individuals with and without intellectual disabilities, as\nwell as among individuals with intellectual disabilities. Remarkably, our\nfindings demonstrate the feasibility of facial expression recognition within\nthis population through tailored user-specific training methodologies, which\nenable the models to effectively address the unique expressions of each user.\n","authors":["F. Xavier Gaya-Morey","Silvia Ramis","Jose M. Buades-Rubio","Cristina Manresa-Yee"],"pdf_url":"https://arxiv.org/pdf/2401.11877v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11874v1","updated":"2024-01-22T12:00:37Z","published":"2024-01-22T12:00:37Z","title":"Detect-Order-Construct: A Tree Construction based Approach for\n Hierarchical Document Structure Analysis","summary":" Document structure analysis (aka document layout analysis) is crucial for\nunderstanding the physical layout and logical structure of documents, with\napplications in information retrieval, document summarization, knowledge\nextraction, etc. In this paper, we concentrate on Hierarchical Document\nStructure Analysis (HDSA) to explore hierarchical relationships within\nstructured documents created using authoring software employing hierarchical\nschemas, such as LaTeX, Microsoft Word, and HTML. To comprehensively analyze\nhierarchical document structures, we propose a tree construction based approach\nthat addresses multiple subtasks concurrently, including page object detection\n(Detect), reading order prediction of identified objects (Order), and the\nconstruction of intended hierarchical structure (Construct). We present an\neffective end-to-end solution based on this framework to demonstrate its\nperformance. To assess our approach, we develop a comprehensive benchmark\ncalled Comp-HRDoc, which evaluates the above subtasks simultaneously. Our\nend-to-end system achieves state-of-the-art performance on two large-scale\ndocument layout analysis datasets (PubLayNet and DocLayNet), a high-quality\nhierarchical document structure reconstruction dataset (HRDoc), and our\nComp-HRDoc benchmark. The Comp-HRDoc benchmark will be released to facilitate\nfurther research in this field.\n","authors":["Jiawei Wang","Kai Hu","Zhuoyao Zhong","Lei Sun","Qiang Huo"],"pdf_url":"https://arxiv.org/pdf/2401.11874v1.pdf","comment":"Submitted to Pattern Recognition"},{"id":"http://arxiv.org/abs/2401.11859v1","updated":"2024-01-22T11:28:24Z","published":"2024-01-22T11:28:24Z","title":"LKFormer: Large Kernel Transformer for Infrared Image Super-Resolution","summary":" Given the broad application of infrared technology across diverse fields,\nthere is an increasing emphasis on investigating super-resolution techniques\nfor infrared images within the realm of deep learning. Despite the impressive\nresults of current Transformer-based methods in image super-resolution tasks,\ntheir reliance on the self-attentive mechanism intrinsic to the Transformer\narchitecture results in images being treated as one-dimensional sequences,\nthereby neglecting their inherent two-dimensional structure. Moreover, infrared\nimages exhibit a uniform pixel distribution and a limited gradient range,\nposing challenges for the model to capture effective feature information.\nConsequently, we suggest a potent Transformer model, termed Large Kernel\nTransformer (LKFormer), to address this issue. Specifically, we have designed a\nLarge Kernel Residual Depth-wise Convolutional Attention (LKRDA) module with\nlinear complexity. This mainly employs depth-wise convolution with large\nkernels to execute non-local feature modeling, thereby substituting the\nstandard self-attentive layer. Additionally, we have devised a novel\nfeed-forward network structure called Gated-Pixel Feed-Forward Network (GPFN)\nto augment the LKFormer's capacity to manage the information flow within the\nnetwork. Comprehensive experimental results reveal that our method surpasses\nthe most advanced techniques available, using fewer parameters and yielding\nconsiderably superior performance.\n","authors":["Feiwei Qin","Kang Yan","Changmiao Wang","Ruiquan Ge","Yong Peng","Kai Zhang"],"pdf_url":"https://arxiv.org/pdf/2401.11859v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11856v1","updated":"2024-01-22T11:25:59Z","published":"2024-01-22T11:25:59Z","title":"MOSformer: Momentum encoder-based inter-slice fusion transformer for\n medical image segmentation","summary":" Medical image segmentation takes an important position in various clinical\napplications. Deep learning has emerged as the predominant solution for\nautomated segmentation of volumetric medical images. 2.5D-based segmentation\nmodels bridge computational efficiency of 2D-based models and spatial\nperception capabilities of 3D-based models. However, prevailing 2.5D-based\nmodels often treat each slice equally, failing to effectively learn and exploit\ninter-slice information, resulting in suboptimal segmentation performances. In\nthis paper, a novel Momentum encoder-based inter-slice fusion transformer\n(MOSformer) is proposed to overcome this issue by leveraging inter-slice\ninformation at multi-scale feature maps extracted by different encoders.\nSpecifically, dual encoders are employed to enhance feature distinguishability\namong different slices. One of the encoders is moving-averaged to maintain the\nconsistency of slice representations. Moreover, an IF-Swin transformer module\nis developed to fuse inter-slice multi-scale features. The MOSformer is\nevaluated on three benchmark datasets (Synapse, ACDC, and AMOS), establishing a\nnew state-of-the-art with 85.63%, 92.19%, and 85.43% of DSC, respectively.\nThese promising results indicate its competitiveness in medical image\nsegmentation. Codes and models of MOSformer will be made publicly available\nupon acceptance.\n","authors":["De-Xing Huang","Xiao-Hu Zhou","Xiao-Liang Xie","Shi-Qi Liu","Zhen-Qiu Feng","Mei-Jiang Gui","Hao Li","Tian-Yu Xiang","Xiu-Ling Liu","Zeng-Guang Hou"],"pdf_url":"https://arxiv.org/pdf/2401.11856v1.pdf","comment":"Under Review"},{"id":"http://arxiv.org/abs/2401.11847v1","updated":"2024-01-22T11:04:55Z","published":"2024-01-22T11:04:55Z","title":"SignVTCL: Multi-Modal Continuous Sign Language Recognition Enhanced by\n Visual-Textual Contrastive Learning","summary":" Sign language recognition (SLR) plays a vital role in facilitating\ncommunication for the hearing-impaired community. SLR is a weakly supervised\ntask where entire videos are annotated with glosses, making it challenging to\nidentify the corresponding gloss within a video segment. Recent studies\nindicate that the main bottleneck in SLR is the insufficient training caused by\nthe limited availability of large-scale datasets. To address this challenge, we\npresent SignVTCL, a multi-modal continuous sign language recognition framework\nenhanced by visual-textual contrastive learning, which leverages the full\npotential of multi-modal data and the generalization ability of language model.\nSignVTCL integrates multi-modal data (video, keypoints, and optical flow)\nsimultaneously to train a unified visual backbone, thereby yielding more robust\nvisual representations. Furthermore, SignVTCL contains a visual-textual\nalignment approach incorporating gloss-level and sentence-level alignment to\nensure precise correspondence between visual features and glosses at the level\nof individual glosses and sentence. Experimental results conducted on three\ndatasets, Phoenix-2014, Phoenix-2014T, and CSL-Daily, demonstrate that SignVTCL\nachieves state-of-the-art results compared with previous methods.\n","authors":["Hao Chen","Jiaze Wang","Ziyu Guo","Jinpeng Li","Donghao Zhou","Bian Wu","Chenyong Guan","Guangyong Chen","Pheng-Ann Heng"],"pdf_url":"https://arxiv.org/pdf/2401.11847v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11844v1","updated":"2024-01-22T11:01:52Z","published":"2024-01-22T11:01:52Z","title":"Adaptive Fusion of Multi-view Remote Sensing data for Optimal Sub-field\n Crop Yield Prediction","summary":" Accurate crop yield prediction is of utmost importance for informed\ndecision-making in agriculture, aiding farmers, and industry stakeholders.\nHowever, this task is complex and depends on multiple factors, such as\nenvironmental conditions, soil properties, and management practices. Combining\nheterogeneous data views poses a fusion challenge, like identifying the\nview-specific contribution to the predictive task. We present a novel\nmulti-view learning approach to predict crop yield for different crops\n(soybean, wheat, rapeseed) and regions (Argentina, Uruguay, and Germany). Our\nmulti-view input data includes multi-spectral optical images from Sentinel-2\nsatellites and weather data as dynamic features during the crop growing season,\ncomplemented by static features like soil properties and topographic\ninformation. To effectively fuse the data, we introduce a Multi-view Gated\nFusion (MVGF) model, comprising dedicated view-encoders and a Gated Unit (GU)\nmodule. The view-encoders handle the heterogeneity of data sources with varying\ntemporal resolutions by learning a view-specific representation. These\nrepresentations are adaptively fused via a weighted sum. The fusion weights are\ncomputed for each sample by the GU using a concatenation of the\nview-representations. The MVGF model is trained at sub-field level with 10 m\nresolution pixels. Our evaluations show that the MVGF outperforms conventional\nmodels on the same task, achieving the best results by incorporating all the\ndata sources, unlike the usual fusion results in the literature. For Argentina,\nthe MVGF model achieves an R2 value of 0.68 at sub-field yield prediction,\nwhile at field level evaluation (comparing field averages), it reaches around\n0.80 across different countries. The GU module learned different weights based\non the country and crop-type, aligning with the variable significance of each\ndata source to the prediction task.\n","authors":["Francisco Mena","Deepak Pathak","Hiba Najjar","Cristhian Sanchez","Patrick Helber","Benjamin Bischke","Peter Habelitz","Miro Miranda","Jayanth Siddamsetty","Marlon Nuske","Marcela Charfuelan","Diego Arenas","Michaela Vollmer","Andreas Dengel"],"pdf_url":"https://arxiv.org/pdf/2401.11844v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11835v1","updated":"2024-01-22T10:52:02Z","published":"2024-01-22T10:52:02Z","title":"Unveiling the Human-like Similarities of Automatic Facial Expression\n Recognition: An Empirical Exploration through Explainable AI","summary":" Facial expression recognition is vital for human behavior analysis, and deep\nlearning has enabled models that can outperform humans. However, it is unclear\nhow closely they mimic human processing. This study aims to explore the\nsimilarity between deep neural networks and human perception by comparing\ntwelve different networks, including both general object classifiers and\nFER-specific models. We employ an innovative global explainable AI method to\ngenerate heatmaps, revealing crucial facial regions for the twelve networks\ntrained on six facial expressions. We assess these results both quantitatively\nand qualitatively, comparing them to ground truth masks based on Friesen and\nEkman's description and among them. We use Intersection over Union (IoU) and\nnormalized correlation coefficients for comparisons. We generate 72 heatmaps to\nhighlight critical regions for each expression and architecture. Qualitatively,\nmodels with pre-trained weights show more similarity in heatmaps compared to\nthose without pre-training. Specifically, eye and nose areas influence certain\nfacial expressions, while the mouth is consistently important across all models\nand expressions. Quantitatively, we find low average IoU values (avg. 0.2702)\nacross all expressions and architectures. The best-performing architecture\naverages 0.3269, while the worst-performing one averages 0.2066. Dendrograms,\nbuilt with the normalized correlation coefficient, reveal two main clusters for\nmost expressions: models with pre-training and models without pre-training.\nFindings suggest limited alignment between human and AI facial expression\nrecognition, with network architectures influencing the similarity, as similar\narchitectures prioritize similar facial regions.\n","authors":["F. Xavier Gaya-Morey","Silvia Ramis-Guarinos","Cristina Manresa-Yee","Jose M. Buades-Rubio"],"pdf_url":"https://arxiv.org/pdf/2401.11835v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11831v1","updated":"2024-01-22T10:42:51Z","published":"2024-01-22T10:42:51Z","title":"A Fair Evaluation of Various Deep Learning-Based Document Image\n Binarization Approaches","summary":" Binarization of document images is an important pre-processing step in the\nfield of document analysis. Traditional image binarization techniques usually\nrely on histograms or local statistics to identify a valid threshold to\ndifferentiate between different aspects of the image. Deep learning techniques\nare able to generate binarized versions of the images by learning\ncontext-dependent features that are less error-prone to degradation typically\noccurring in document images. In recent years, many deep learning-based methods\nhave been developed for document binarization. But which one to choose? There\nhave been no studies that compare these methods rigorously. Therefore, this\nwork focuses on the evaluation of different deep learning-based methods under\nthe same evaluation protocol. We evaluate them on different Document Image\nBinarization Contest (DIBCO) datasets and obtain very heterogeneous results. We\nshow that the DE-GAN model was able to perform better compared to other models\nwhen evaluated on the DIBCO2013 dataset while DP-LinkNet performed best on the\nDIBCO2017 dataset. The 2-StageGAN performed best on the DIBCO2018 dataset while\nSauvolaNet outperformed the others on the DIBCO2019 challenge. Finally, we make\nthe code, all models and evaluation publicly available\n(https://github.com/RichSu95/Document_Binarization_Collection) to ensure\nreproducibility and simplify future binarization evaluations.\n","authors":["Richin Sukesh","Mathias Seuret","Anguelos Nicolaou","Martin Mayr","Vincent Christlein"],"pdf_url":"https://arxiv.org/pdf/2401.11831v1.pdf","comment":"DAS 2022"},{"id":"http://arxiv.org/abs/2401.11824v1","updated":"2024-01-22T10:37:59Z","published":"2024-01-22T10:37:59Z","title":"Rethinking Centered Kernel Alignment in Knowledge Distillation","summary":" Knowledge distillation has emerged as a highly effective method for bridging\nthe representation discrepancy between large-scale models and lightweight\nmodels. Prevalent approaches involve leveraging appropriate metrics to minimize\nthe divergence or distance between the knowledge extracted from the teacher\nmodel and the knowledge learned by the student model. Centered Kernel Alignment\n(CKA) is widely used to measure representation similarity and has been applied\nin several knowledge distillation methods. However, these methods are complex\nand fail to uncover the essence of CKA, thus not answering the question of how\nto use CKA to achieve simple and effective distillation properly. This paper\nfirst provides a theoretical perspective to illustrate the effectiveness of\nCKA, which decouples CKA to the upper bound of Maximum Mean Discrepancy~(MMD)\nand a constant term. Drawing from this, we propose a novel Relation-Centered\nKernel Alignment~(RCKA) framework, which practically establishes a connection\nbetween CKA and MMD. Furthermore, we dynamically customize the application of\nCKA based on the characteristics of each task, with less computational source\nyet comparable performance than the previous methods. The extensive experiments\non the CIFAR-100, ImageNet-1k, and MS-COCO demonstrate that our method achieves\nstate-of-the-art performance on almost all teacher-student pairs for image\nclassification and object detection, validating the effectiveness of our\napproaches.\n","authors":["Zikai Zhou","Yunhang Shen","Shitong Shao","Huanran Chen","Linrui Gong","Shaohui Lin"],"pdf_url":"https://arxiv.org/pdf/2401.11824v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11814v1","updated":"2024-01-22T10:22:14Z","published":"2024-01-22T10:22:14Z","title":"Symbrain: A large-scale dataset of MRI images for neonatal brain\n symmetry analysis","summary":" This paper presents an annotated dataset of brain MRI images designed to\nadvance the field of brain symmetry study. Magnetic resonance imaging (MRI) has\ngained interest in analyzing brain symmetry in neonatal infants, and challenges\nremain due to the vast size differences between fetal and adult brains.\nClassification methods for brain structural MRI use scales and visual cues to\nassess hemisphere symmetry, which can help diagnose neonatal patients by\ncomparing hemispheres and anatomical regions of interest in the brain. Using\nthe Developing Human Connectome Project dataset, this work presents a dataset\ncomprising cerebral images extracted as slices across selected portions of\ninterest for clinical evaluation . All the extracted images are annotated with\nthe brain's midline. All the extracted images are annotated with the brain's\nmidline. From the assumption that a decrease in symmetry is directly related to\npossible clinical pathologies, the dataset can contribute to a more precise\ndiagnosis because it can be used to train deep learning model application in\nneonatal cerebral MRI anomaly detection from postnatal infant scans thanks to\ncomputer vision. Such models learn to identify and classify anomalies by\nidentifying potential asymmetrical patterns in medical MRI images. Furthermore,\nthis dataset can contribute to the research and development of methods using\nthe relative symmetry of the two brain hemispheres for crucial diagnosis and\ntreatment planning.\n","authors":["Arnaud Gucciardi","Safouane El Ghazouali","Francesca Venturini","Vida Groznik","Umberto Michelucci"],"pdf_url":"https://arxiv.org/pdf/2401.11814v1.pdf","comment":"7 pages, 2 figures, Dataset Paper, Medical AI"},{"id":"http://arxiv.org/abs/2401.02436v2","updated":"2024-01-22T10:08:28Z","published":"2023-11-17T14:40:43Z","title":"Compressed 3D Gaussian Splatting for Accelerated Novel View Synthesis","summary":" Recently, high-fidelity scene reconstruction with an optimized 3D Gaussian\nsplat representation has been introduced for novel view synthesis from sparse\nimage sets. Making such representations suitable for applications like network\nstreaming and rendering on low-power devices requires significantly reduced\nmemory consumption as well as improved rendering efficiency. We propose a\ncompressed 3D Gaussian splat representation that utilizes sensitivity-aware\nvector clustering with quantization-aware training to compress directional\ncolors and Gaussian parameters. The learned codebooks have low bitrates and\nachieve a compression rate of up to $31\\times$ on real-world scenes with only\nminimal degradation of visual quality. We demonstrate that the compressed splat\nrepresentation can be efficiently rendered with hardware rasterization on\nlightweight GPUs at up to $4\\times$ higher framerates than reported via an\noptimized GPU compute pipeline. Extensive experiments across multiple datasets\ndemonstrate the robustness and rendering speed of the proposed approach.\n","authors":["Simon Niedermayr","Josef Stumpfegger","Rüdiger Westermann"],"pdf_url":"https://arxiv.org/pdf/2401.02436v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11796v1","updated":"2024-01-22T09:53:20Z","published":"2024-01-22T09:53:20Z","title":"Local Agnostic Video Explanations: a Study on the Applicability of\n Removal-Based Explanations to Video","summary":" Explainable artificial intelligence techniques are becoming increasingly\nimportant with the rise of deep learning applications in various domains. These\ntechniques aim to provide a better understanding of complex \"black box\" models\nand enhance user trust while maintaining high learning performance. While many\nstudies have focused on explaining deep learning models in computer vision for\nimage input, video explanations remain relatively unexplored due to the\ntemporal dimension's complexity. In this paper, we present a unified framework\nfor local agnostic explanations in the video domain. Our contributions include:\n(1) Extending a fine-grained explanation framework tailored for computer vision\ndata, (2) Adapting six existing explanation techniques to work on video data by\nincorporating temporal information and enabling local explanations, and (3)\nConducting an evaluation and comparison of the adapted explanation methods\nusing different models and datasets. We discuss the possibilities and choices\ninvolved in the removal-based explanation process for visual data. The\nadaptation of six explanation methods for video is explained, with comparisons\nto existing approaches. We evaluate the performance of the methods using\nautomated metrics and user-based evaluation, showing that 3D RISE, 3D LIME, and\n3D Kernel SHAP outperform other methods. By decomposing the explanation process\ninto manageable steps, we facilitate the study of each choice's impact and\nallow for further refinement of explanation methods to suit specific datasets\nand models.\n","authors":["F. Xavier Gaya-Morey","Jose M. Buades-Rubio","Cristina Manresa-Yee"],"pdf_url":"https://arxiv.org/pdf/2401.11796v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.12817v2","updated":"2024-01-22T09:44:18Z","published":"2023-10-19T15:12:44Z","title":"2D-3D Interlaced Transformer for Point Cloud Segmentation with\n Scene-Level Supervision","summary":" We present a Multimodal Interlaced Transformer (MIT) that jointly considers\n2D and 3D data for weakly supervised point cloud segmentation. Research studies\nhave shown that 2D and 3D features are complementary for point cloud\nsegmentation. However, existing methods require extra 2D annotations to achieve\n2D-3D information fusion. Considering the high annotation cost of point clouds,\neffective 2D and 3D feature fusion based on weakly supervised learning is in\ngreat demand. To this end, we propose a transformer model with two encoders and\none decoder for weakly supervised point cloud segmentation using only\nscene-level class tags. Specifically, the two encoders compute the\nself-attended features for 3D point clouds and 2D multi-view images,\nrespectively. The decoder implements interlaced 2D-3D cross-attention and\ncarries out implicit 2D and 3D feature fusion. We alternately switch the roles\nof queries and key-value pairs in the decoder layers. It turns out that the 2D\nand 3D features are iteratively enriched by each other. Experiments show that\nit performs favorably against existing weakly supervised point cloud\nsegmentation methods by a large margin on the S3DIS and ScanNet benchmarks. The\nproject page will be available at https://jimmy15923.github.io/mit_web/.\n","authors":["Cheng-Kun Yang","Min-Hung Chen","Yung-Yu Chuang","Yen-Yu Lin"],"pdf_url":"https://arxiv.org/pdf/2310.12817v2.pdf","comment":"ICCV 2023 (main + supp). Website:\n https://jimmy15923.github.io/mit_web/"},{"id":"http://arxiv.org/abs/2401.11791v1","updated":"2024-01-22T09:41:05Z","published":"2024-01-22T09:41:05Z","title":"SemPLeS: Semantic Prompt Learning for Weakly-Supervised Semantic\n Segmentation","summary":" Weakly-Supervised Semantic Segmentation (WSSS) aims to train segmentation\nmodels using training image data with only image-level supervision. Since\nprecise pixel-level annotations are not accessible, existing methods typically\nfocus on producing pseudo masks for training segmentation models by refining\nCAM-like heatmaps. However, the produced heatmaps may only capture\ndiscriminative image regions of target object categories or the associated\nco-occurring backgrounds. To address the issues, we propose a Semantic Prompt\nLearning for WSSS (SemPLeS) framework, which learns to effectively prompt the\nCLIP space to enhance the semantic alignment between the segmented regions and\nthe target object categories. More specifically, we propose Contrastive Prompt\nLearning and Class-associated Semantic Refinement to learn the prompts that\nadequately describe and suppress the image backgrounds associated with each\ntarget object category. In this way, our proposed framework is able to perform\nbetter semantic matching between object regions and the associated text labels,\nresulting in desired pseudo masks for training the segmentation model. The\nproposed SemPLeS framework achieves SOTA performance on the standard WSSS\nbenchmarks, PASCAL VOC and MS COCO, and demonstrated interpretability with the\nsemantic visualization of our learned prompts. The codes will be released.\n","authors":["Ci-Siang Lin","Chien-Yi Wang","Yu-Chiang Frank Wang","Min-Hung Chen"],"pdf_url":"https://arxiv.org/pdf/2401.11791v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11790v1","updated":"2024-01-22T09:40:52Z","published":"2024-01-22T09:40:52Z","title":"Deep Learning for Computer Vision based Activity Recognition and Fall\n Detection of the Elderly: a Systematic Review","summary":" As the percentage of elderly people in developed countries increases\nworldwide, the healthcare of this collective is a worrying matter, especially\nif it includes the preservation of their autonomy. In this direction, many\nstudies are being published on Ambient Assisted Living (AAL) systems, which\nhelp to reduce the preoccupations raised by the independent living of the\nelderly. In this study, a systematic review of the literature is presented on\nfall detection and Human Activity Recognition (HAR) for the elderly, as the two\nmain tasks to solve to guarantee the safety of elderly people living alone. To\naddress the current tendency to perform these two tasks, the review focuses on\nthe use of Deep Learning (DL) based approaches on computer vision data. In\naddition, different collections of data like DL models, datasets or hardware\n(e.g. depth or thermal cameras) are gathered from the reviewed studies and\nprovided for reference in future studies. Strengths and weaknesses of existing\napproaches are also discussed and, based on them, our recommendations for\nfuture works are provided.\n","authors":["F. Xavier Gaya-Morey","Cristina Manresa-Yee","Jose M. Buades-Rubio"],"pdf_url":"https://arxiv.org/pdf/2401.11790v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11783v1","updated":"2024-01-22T09:29:42Z","published":"2024-01-22T09:29:42Z","title":"Full-Body Motion Reconstruction with Sparse Sensing from Graph\n Perspective","summary":" Estimating 3D full-body pose from sparse sensor data is a pivotal technique\nemployed for the reconstruction of realistic human motions in Augmented Reality\nand Virtual Reality. However, translating sparse sensor signals into\ncomprehensive human motion remains a challenge since the sparsely distributed\nsensors in common VR systems fail to capture the motion of full human body. In\nthis paper, we use well-designed Body Pose Graph (BPG) to represent the human\nbody and translate the challenge into a prediction problem of graph missing\nnodes. Then, we propose a novel full-body motion reconstruction framework based\non BPG. To establish BPG, nodes are initially endowed with features extracted\nfrom sparse sensor signals. Features from identifiable joint nodes across\ndiverse sensors are amalgamated and processed from both temporal and spatial\nperspectives. Temporal dynamics are captured using the Temporal Pyramid\nStructure, while spatial relations in joint movements inform the spatial\nattributes. The resultant features serve as the foundational elements of the\nBPG nodes. To further refine the BPG, node features are updated through a graph\nneural network that incorporates edge reflecting varying joint relations. Our\nmethod's effectiveness is evidenced by the attained state-of-the-art\nperformance, particularly in lower body motion, outperforming other baseline\nmethods. Additionally, an ablation study validates the efficacy of each module\nin our proposed framework.\n","authors":["Feiyu Yao","Zongkai Wu","Li Yi"],"pdf_url":"https://arxiv.org/pdf/2401.11783v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11775v1","updated":"2024-01-22T09:11:12Z","published":"2024-01-22T09:11:12Z","title":"Collaborative Position Reasoning Network for Referring Image\n Segmentation","summary":" Given an image and a natural language expression as input, the goal of\nreferring image segmentation is to segment the foreground masks of the entities\nreferred by the expression. Existing methods mainly focus on interactive\nlearning between vision and language to enhance the multi-modal representations\nfor global context reasoning. However, predicting directly in pixel-level space\ncan lead to collapsed positioning and poor segmentation results. Its main\nchallenge lies in how to explicitly model entity localization, especially for\nnon-salient entities. In this paper, we tackle this problem by executing a\nCollaborative Position Reasoning Network (CPRN) via the proposed novel\nRow-and-Column interactive (RoCo) and Guided Holistic interactive (Holi)\nmodules. Specifically, RoCo aggregates the visual features into the row- and\ncolumn-wise features corresponding two directional axes respectively. It offers\na fine-grained matching behavior that perceives the associations between the\nlinguistic features and two decoupled visual features to perform position\nreasoning over a hierarchical space. Holi integrates features of the two\nmodalities by a cross-modal attention mechanism, which suppresses the\nirrelevant redundancy under the guide of positioning information from RoCo.\nThus, with the incorporation of RoCo and Holi modules, CPRN captures the visual\ndetails of position reasoning so that the model can achieve more accurate\nsegmentation. To our knowledge, this is the first work that explicitly focuses\non position reasoning modeling. We also validate the proposed method on three\nevaluation datasets. It consistently outperforms existing state-of-the-art\nmethods.\n","authors":["Jianjian Cao","Beiya Dai","Yulin Li","Xiameng Qin","Jingdong Wang"],"pdf_url":"https://arxiv.org/pdf/2401.11775v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.17004v2","updated":"2024-01-22T09:10:04Z","published":"2023-12-28T13:16:03Z","title":"Continual Learning in Medical Image Analysis: A Comprehensive Review of\n Recent Advancements and Future Prospects","summary":" Medical imaging analysis has witnessed remarkable advancements even\nsurpassing human-level performance in recent years, driven by the rapid\ndevelopment of advanced deep-learning algorithms. However, when the inference\ndataset slightly differs from what the model has seen during one-time training,\nthe model performance is greatly compromised. The situation requires restarting\nthe training process using both the old and the new data which is\ncomputationally costly, does not align with the human learning process, and\nimposes storage constraints and privacy concerns. Alternatively, continual\nlearning has emerged as a crucial approach for developing unified and\nsustainable deep models to deal with new classes, tasks, and the drifting\nnature of data in non-stationary environments for various application areas.\nContinual learning techniques enable models to adapt and accumulate knowledge\nover time, which is essential for maintaining performance on evolving datasets\nand novel tasks. This systematic review paper provides a comprehensive overview\nof the state-of-the-art in continual learning techniques applied to medical\nimaging analysis. We present an extensive survey of existing research, covering\ntopics including catastrophic forgetting, data drifts, stability, and\nplasticity requirements. Further, an in-depth discussion of key components of a\ncontinual learning framework such as continual learning scenarios, techniques,\nevaluation schemes, and metrics is provided. Continual learning techniques\nencompass various categories, including rehearsal, regularization,\narchitectural, and hybrid strategies. We assess the popularity and\napplicability of continual learning categories in various medical sub-fields\nlike radiology and histopathology...\n","authors":["Pratibha Kumari","Joohi Chauhan","Afshin Bozorgpour","Boqiang Huang","Reza Azad","Dorit Merhof"],"pdf_url":"https://arxiv.org/pdf/2312.17004v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11767v1","updated":"2024-01-22T09:02:52Z","published":"2024-01-22T09:02:52Z","title":"Concealed Object Segmentation with Hierarchical Coherence Modeling","summary":" Concealed object segmentation (COS) is a challenging task that involves\nlocalizing and segmenting those concealed objects that are visually blended\nwith their surrounding environments. Despite achieving remarkable success,\nexisting COS segmenters still struggle to achieve complete segmentation results\nin extremely concealed scenarios. In this paper, we propose a Hierarchical\nCoherence Modeling (HCM) segmenter for COS, aiming to address this incomplete\nsegmentation limitation. In specific, HCM promotes feature coherence by\nleveraging the intra-stage coherence and cross-stage coherence modules,\nexploring feature correlations at both the single-stage and contextual levels.\nAdditionally, we introduce the reversible re-calibration decoder to detect\npreviously undetected parts in low-confidence regions, resulting in further\nenhancing segmentation performance. Extensive experiments conducted on three\nCOS tasks, including camouflaged object detection, polyp image segmentation,\nand transparent object detection, demonstrate the promising results achieved by\nthe proposed HCM segmenter.\n","authors":["Fengyang Xiao","Pan Zhang","Chunming He","Runze Hu","Yutao Liu"],"pdf_url":"https://arxiv.org/pdf/2401.11767v1.pdf","comment":"Accepted to CICAI 2023. 13 pages, 6 figures, 4 tables"},{"id":"http://arxiv.org/abs/2401.11751v1","updated":"2024-01-22T08:23:52Z","published":"2024-01-22T08:23:52Z","title":"Boosting Multi-view Stereo with Late Cost Aggregation","summary":" Pairwise matching cost aggregation is a crucial step for modern\nlearning-based Multi-view Stereo (MVS). Prior works adopt an early aggregation\nscheme, which adds up pairwise costs into an intermediate cost. However, we\nanalyze that this process can degrade informative pairwise matchings, thereby\nblocking the depth network from fully utilizing the original geometric matching\ncues.To address this challenge, we present a late aggregation approach that\nallows for aggregating pairwise costs throughout the network feed-forward\nprocess, achieving accurate estimations with only minor changes of the plain\nCasMVSNet.Instead of building an intermediate cost by weighted sum, late\naggregation preserves all pairwise costs along a distinct view channel. This\nenables the succeeding depth network to fully utilize the crucial geometric\ncues without loss of cost fidelity. Grounded in the new aggregation scheme, we\npropose further techniques addressing view order dependence inside the\npreserved cost, handling flexible testing views, and improving the depth\nfiltering process. Despite its technical simplicity, our method improves\nsignificantly upon the baseline cascade-based approach, achieving comparable\nresults with state-of-the-art methods with favorable computation overhead.\n","authors":["Jiang Wu","Rui Li","Yu Zhu","Wenxun Zhao","Jinqiu Sun","Yanning Zhang"],"pdf_url":"https://arxiv.org/pdf/2401.11751v1.pdf","comment":"Code and models are available at https://github.com/Wuuu3511/LAMVSNET"},{"id":"http://arxiv.org/abs/2401.11740v1","updated":"2024-01-22T07:37:25Z","published":"2024-01-22T07:37:25Z","title":"Multi-level Cross-modal Alignment for Image Clustering","summary":" Recently, the cross-modal pretraining model has been employed to produce\nmeaningful pseudo-labels to supervise the training of an image clustering\nmodel. However, numerous erroneous alignments in a cross-modal pre-training\nmodel could produce poor-quality pseudo-labels and degrade clustering\nperformance. To solve the aforementioned issue, we propose a novel\n\\textbf{Multi-level Cross-modal Alignment} method to improve the alignments in\na cross-modal pretraining model for downstream tasks, by building a smaller but\nbetter semantic space and aligning the images and texts in three levels, i.e.,\ninstance-level, prototype-level, and semantic-level. Theoretical results show\nthat our proposed method converges, and suggests effective means to reduce the\nexpected clustering risk of our method. Experimental results on five benchmark\ndatasets clearly show the superiority of our new method.\n","authors":["Liping Qiu","Qin Zhang","Xiaojun Chen","Shaotian Cai"],"pdf_url":"https://arxiv.org/pdf/2401.11740v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11739v1","updated":"2024-01-22T07:34:06Z","published":"2024-01-22T07:34:06Z","title":"EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models","summary":" Diffusion models have recently received increasing research attention for\ntheir remarkable transfer abilities in semantic segmentation tasks. However,\ngenerating fine-grained segmentation masks with diffusion models often requires\nadditional training on annotated datasets, leaving it unclear to what extent\npre-trained diffusion models alone understand the semantic relations of their\ngenerated images. To address this question, we leverage the semantic knowledge\nextracted from Stable Diffusion (SD) and aim to develop an image segmentor\ncapable of generating fine-grained segmentation maps without any additional\ntraining. The primary difficulty stems from the fact that semantically\nmeaningful feature maps typically exist only in the spatially lower-dimensional\nlayers, which poses a challenge in directly extracting pixel-level semantic\nrelations from these feature maps. To overcome this issue, our framework\nidentifies semantic correspondences between image pixels and spatial locations\nof low-dimensional feature maps by exploiting SD's generation process and\nutilizes them for constructing image-resolution segmentation maps. In extensive\nexperiments, the produced segmentation maps are demonstrated to be well\ndelineated and capture detailed parts of the images, indicating the existence\nof highly accurate pixel-level semantic knowledge in diffusion models.\n","authors":["Koichi Namekata","Amirmojtaba Sabour","Sanja Fidler","Seung Wook Kim"],"pdf_url":"https://arxiv.org/pdf/2401.11739v1.pdf","comment":"ICLR 2024. Project page: https://kmcode1.github.io/Projects/EmerDiff/"},{"id":"http://arxiv.org/abs/2401.11738v1","updated":"2024-01-22T07:31:52Z","published":"2024-01-22T07:31:52Z","title":"MetaSeg: Content-Aware Meta-Net for Omni-Supervised Semantic\n Segmentation","summary":" Noisy labels, inevitably existing in pseudo segmentation labels generated\nfrom weak object-level annotations, severely hampers model optimization for\nsemantic segmentation. Previous works often rely on massive hand-crafted losses\nand carefully-tuned hyper-parameters to resist noise, suffering poor\ngeneralization capability and high model complexity. Inspired by recent\nadvances in meta learning, we argue that rather than struggling to tolerate\nnoise hidden behind clean labels passively, a more feasible solution would be\nto find out the noisy regions actively, so as to simply ignore them during\nmodel optimization. With this in mind, this work presents a novel meta learning\nbased semantic segmentation method, MetaSeg, that comprises a primary\ncontent-aware meta-net (CAM-Net) to sever as a noise indicator for an arbitrary\nsegmentation model counterpart. Specifically, CAM-Net learns to generate\npixel-wise weights to suppress noisy regions with incorrect pseudo labels while\nhighlighting clean ones by exploiting hybrid strengthened features from image\ncontent, providing straightforward and reliable guidance for optimizing the\nsegmentation model. Moreover, to break the barrier of time-consuming training\nwhen applying meta learning to common large segmentation models, we further\npresent a new decoupled training strategy that optimizes different model layers\nin a divide-and-conquer manner. Extensive experiments on object, medical,\nremote sensing and human segmentation shows that our method achieves superior\nperformance, approaching that of fully supervised settings, which paves a new\npromising way for omni-supervised semantic segmentation.\n","authors":["Shenwang Jiang","Jianan Li","Ying Wang","Wenxuan Wu","Jizhou Zhang","Bo Huang","Tingfa Xu"],"pdf_url":"https://arxiv.org/pdf/2401.11738v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.16244v2","updated":"2024-01-22T07:29:09Z","published":"2023-12-25T11:39:00Z","title":"Modality-missing RGBT Tracking via Invertible Prompt Learning and A\n High-quality Data Simulation Method","summary":" Current RGBT tracking researches mainly focus on the modality-complete\nscenarios, overlooking the modality-missing challenge in real-world scenes. In\nthis work, we comprehensively investigate the impact of modality-missing\nchallenge in RGBT tracking and propose a novel invertible prompt learning\napproach, which integrates the content-preserving prompts into a well-trained\ntracking model to adapt to various modality-missing scenarios, for\nmodality-missing RGBT tracking. In particular, given one modality-missing\nscenario, we propose to utilize the available modality to generate the prompt\nof the missing modality to adapt to RGBT tracking model. However, the\ncross-modality gap between available and missing modalities usually causes\nsemantic distortion and information loss in prompt generation. To handle this\nissue, we propose the invertible prompt learning scheme by incorporating the\nfull reconstruction of the input available modality from the prompt in prompt\ngeneration model. Considering that there lacks a modality-missing RGBT tracking\ndataset and many modality-missing scenarios are difficult to capture, we design\na high-quality data simulation method based on hierarchical combination schemes\nto generate real-world modality-missing data. Extensive experiments on three\nmodality-missing datasets show that our method achieves significant performance\nimprovements compared with state-of-the-art methods. We will release the code\nand simulation dataset.\n","authors":["Andong Lu","Jiacong Zhao","Chenglong Li","Jin Tang","Bin Luo"],"pdf_url":"https://arxiv.org/pdf/2312.16244v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.14571v4","updated":"2024-01-22T07:24:58Z","published":"2022-10-26T09:01:19Z","title":"Towards the Detection of Diffusion Model Deepfakes","summary":" In the course of the past few years, diffusion models (DMs) have reached an\nunprecedented level of visual quality. However, relatively little attention has\nbeen paid to the detection of DM-generated images, which is critical to prevent\nadverse impacts on our society. In contrast, generative adversarial networks\n(GANs), have been extensively studied from a forensic perspective. In this\nwork, we therefore take the natural next step to evaluate whether previous\nmethods can be used to detect images generated by DMs. Our experiments yield\ntwo key findings: (1) state-of-the-art GAN detectors are unable to reliably\ndistinguish real from DM-generated images, but (2) re-training them on\nDM-generated images allows for almost perfect detection, which remarkably even\ngeneralizes to GANs. Together with a feature space analysis, our results lead\nto the hypothesis that DMs produce fewer detectable artifacts and are thus more\ndifficult to detect compared to GANs. One possible reason for this is the\nabsence of grid-like frequency artifacts in DM-generated images, which are a\nknown weakness of GANs. However, we make the interesting observation that\ndiffusion models tend to underestimate high frequencies, which we attribute to\nthe learning objective.\n","authors":["Jonas Ricker","Simon Damm","Thorsten Holz","Asja Fischer"],"pdf_url":"https://arxiv.org/pdf/2210.14571v4.pdf","comment":"Accepted at VISAPP 2024. This is the extended version with additional\n experiments and supplemental material. Code and data:\n https://github.com/jonasricker/diffusion-model-deepfake-detection"},{"id":"http://arxiv.org/abs/2401.11734v1","updated":"2024-01-22T07:23:44Z","published":"2024-01-22T07:23:44Z","title":"Colorectal Polyp Segmentation in the Deep Learning Era: A Comprehensive\n Survey","summary":" Colorectal polyp segmentation (CPS), an essential problem in medical image\nanalysis, has garnered growing research attention. Recently, the deep\nlearning-based model completely overwhelmed traditional methods in the field of\nCPS, and more and more deep CPS methods have emerged, bringing the CPS into the\ndeep learning era. To help the researchers quickly grasp the main techniques,\ndatasets, evaluation metrics, challenges, and trending of deep CPS, this paper\npresents a systematic and comprehensive review of deep-learning-based CPS\nmethods from 2014 to 2023, a total of 115 technical papers. In particular, we\nfirst provide a comprehensive review of the current deep CPS with a novel\ntaxonomy, including network architectures, level of supervision, and learning\nparadigm. More specifically, network architectures include eight subcategories,\nthe level of supervision comprises six subcategories, and the learning paradigm\nencompasses 12 subcategories, totaling 26 subcategories. Then, we provided a\ncomprehensive analysis the characteristics of each dataset, including the\nnumber of datasets, annotation types, image resolution, polyp size, contrast\nvalues, and polyp location. Following that, we summarized CPS's commonly used\nevaluation metrics and conducted a detailed analysis of 40 deep SOTA models,\nincluding out-of-distribution generalization and attribute-based performance\nanalysis. Finally, we discussed deep learning-based CPS methods' main\nchallenges and opportunities.\n","authors":["Zhenyu Wu","Fengmao Lv","Chenglizhao Chen","Aimin Hao","Shuo Li"],"pdf_url":"https://arxiv.org/pdf/2401.11734v1.pdf","comment":"21 pages, 8 figures"},{"id":"http://arxiv.org/abs/2309.02773v3","updated":"2024-01-22T07:18:55Z","published":"2023-09-06T06:31:08Z","title":"Diffusion Model is Secretly a Training-free Open Vocabulary Semantic\n Segmenter","summary":" The pre-trained text-image discriminative models, such as CLIP, has been\nexplored for open-vocabulary semantic segmentation with unsatisfactory results\ndue to the loss of crucial localization information and awareness of object\nshapes. Recently, there has been a growing interest in expanding the\napplication of generative models from generation tasks to semantic\nsegmentation. These approaches utilize generative models either for generating\nannotated data or extracting features to facilitate semantic segmentation. This\ntypically involves generating a considerable amount of synthetic data or\nrequiring additional mask annotations. To this end, we uncover the potential of\ngenerative text-to-image diffusion models (e.g., Stable Diffusion) as highly\nefficient open-vocabulary semantic segmenters, and introduce a novel\ntraining-free approach named DiffSegmenter. The insight is that to generate\nrealistic objects that are semantically faithful to the input text, both the\ncomplete object shapes and the corresponding semantics are implicitly learned\nby diffusion models. We discover that the object shapes are characterized by\nthe self-attention maps while the semantics are indicated through the\ncross-attention maps produced by the denoising U-Net, forming the basis of our\nsegmentation results.Additionally, we carefully design effective textual\nprompts and a category filtering mechanism to further enhance the segmentation\nresults. Extensive experiments on three benchmark datasets show that the\nproposed DiffSegmenter achieves impressive results for open-vocabulary semantic\nsegmentation.\n","authors":["Jinglong Wang","Xiawei Li","Jing Zhang","Qingyuan Xu","Qin Zhou","Qian Yu","Lu Sheng","Dong Xu"],"pdf_url":"https://arxiv.org/pdf/2309.02773v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11726v1","updated":"2024-01-22T07:07:32Z","published":"2024-01-22T07:07:32Z","title":"Detecting Out-of-Distribution Samples via Conditional Distribution\n Entropy with Optimal Transport","summary":" When deploying a trained machine learning model in the real world, it is\ninevitable to receive inputs from out-of-distribution (OOD) sources. For\ninstance, in continual learning settings, it is common to encounter OOD samples\ndue to the non-stationarity of a domain. More generally, when we have access to\na set of test inputs, the existing rich line of OOD detection solutions,\nespecially the recent promise of distance-based methods, falls short in\neffectively utilizing the distribution information from training samples and\ntest inputs. In this paper, we argue that empirical probability distributions\nthat incorporate geometric information from both training samples and test\ninputs can be highly beneficial for OOD detection in the presence of test\ninputs available. To address this, we propose to model OOD detection as a\ndiscrete optimal transport problem. Within the framework of optimal transport,\nwe propose a novel score function known as the \\emph{conditional distribution\nentropy} to quantify the uncertainty of a test input being an OOD sample. Our\nproposal inherits the merits of certain distance-based methods while\neliminating the reliance on distribution assumptions, a-prior knowledge, and\nspecific training mechanisms. Extensive experiments conducted on benchmark\ndatasets demonstrate that our method outperforms its competitors in OOD\ndetection.\n","authors":["Chuanwen Feng","Wenlong Chen","Ao Ke","Yilong Ren","Xike Xie","S. Kevin Zhou"],"pdf_url":"https://arxiv.org/pdf/2401.11726v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11724v1","updated":"2024-01-22T06:56:52Z","published":"2024-01-22T06:56:52Z","title":"Augmenting Prototype Network with TransMix for Few-shot Hyperspectral\n Image Classification","summary":" Few-shot hyperspectral image classification aims to identify the classes of\neach pixel in the images by only marking few of these pixels. And in order to\nobtain the spatial-spectral joint features of each pixel, the fixed-size\npatches centering around each pixel are often used for classification. However,\nobserving the classification results of existing methods, we found that\nboundary patches corresponding to the pixels which are located at the boundary\nof the objects in the hyperspectral images, are hard to classify. These\nboundary patchs are mixed with multi-class spectral information. Inspired by\nthis, we propose to augment the prototype network with TransMix for few-shot\nhyperspectrial image classification(APNT). While taking the prototype network\nas the backbone, it adopts the transformer as feature extractor to learn the\npixel-to-pixel relation and pay different attentions to different pixels. At\nthe same time, instead of directly using the patches which are cut from the\nhyperspectral images for training, it randomly mixs up two patches to imitate\nthe boundary patches and uses the synthetic patches to train the model, with\nthe aim to enlarge the number of hard training samples and enhance their\ndiversity. And by following the data agumentation technique TransMix, the\nattention returned by the transformer is also used to mix up the labels of two\npatches to generate better labels for synthetic patches. Compared with existing\nmethods, the proposed method has demonstrated sate of the art performance and\nbetter robustness for few-shot hyperspectral image classification in our\nexperiments.\n","authors":["Chun Liu","Longwei Yang","Dongmei Dong","Zheng Li","Wei Yang","Zhigang Han","Jiayao Wang"],"pdf_url":"https://arxiv.org/pdf/2401.11724v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.17240v3","updated":"2024-01-22T06:53:23Z","published":"2023-12-28T18:58:33Z","title":"LISA++: An Improved Baseline for Reasoning Segmentation with Large\n Language Model","summary":" While LISA effectively bridges the gap between segmentation and large\nlanguage models to enable reasoning segmentation, it poses certain limitations:\nunable to distinguish different instances of the target region, and constrained\nby the pre-defined textual response formats. In this work, we introduce LISA++,\nan update to the existing LISA model, focusing on improving core\nfunctionalities while keeping the base architecture intact. The main\nenhancements in LISA++ include: \\textbf{1) Enhanced Segmentation}: The instance\nsegmentation ability has been added, providing a more detailed scene analysis\nalong with the existing multi-region semantic segmentation. \\textbf{2) More\nNatural Conversation}: Improved capability for multi-turn dialogue, with the\nability to incorporate segmentation results directly into text responses, i.e.,\nSegmentation in Dialogue (SiD). These improvements are achieved by curating the\nexisting samples of generic segmentation datasets, aimed specifically at\nenhancing the segmentation and conversational skills without structural change\nand additional data sources. Comparative analysis with the original LISA model\nshows significant advancements in these areas, positioning LISA++ as a notable\nupgrade in visual understanding and interaction. LISA++'s adaptability and\nimproved features highlight the versatility of the mask-as-embedding paradigm\nproposed by LISA, and the potential as a foundational model for diverse\napplications.\n","authors":["Senqiao Yang","Tianyuan Qu","Xin Lai","Zhuotao Tian","Bohao Peng","Shu Liu","Jiaya Jia"],"pdf_url":"https://arxiv.org/pdf/2312.17240v3.pdf","comment":"Typo fixed"},{"id":"http://arxiv.org/abs/2211.08824v4","updated":"2024-01-22T06:46:27Z","published":"2022-11-16T10:49:48Z","title":"SMILEtrack: SiMIlarity LEarning for Occlusion-Aware Multiple Object\n Tracking","summary":" Despite recent progress in Multiple Object Tracking (MOT), several obstacles\nsuch as occlusions, similar objects, and complex scenes remain an open\nchallenge. Meanwhile, a systematic study of the cost-performance tradeoff for\nthe popular tracking-by-detection paradigm is still lacking. This paper\nintroduces SMILEtrack, an innovative object tracker that effectively addresses\nthese challenges by integrating an efficient object detector with a Siamese\nnetwork-based Similarity Learning Module (SLM). The technical contributions of\nSMILETrack are twofold. First, we propose an SLM that calculates the appearance\nsimilarity between two objects, overcoming the limitations of feature\ndescriptors in Separate Detection and Embedding (SDE) models. The SLM\nincorporates a Patch Self-Attention (PSA) block inspired by the vision\nTransformer, which generates reliable features for accurate similarity\nmatching. Second, we develop a Similarity Matching Cascade (SMC) module with a\nnovel GATE function for robust object matching across consecutive video frames,\nfurther enhancing MOT performance. Together, these innovations help SMILETrack\nachieve an improved trade-off between the cost ({\\em e.g.}, running speed) and\nperformance (e.g., tracking accuracy) over several existing state-of-the-art\nbenchmarks, including the popular BYTETrack method. SMILETrack outperforms\nBYTETrack by 0.4-0.8 MOTA and 2.1-2.2 HOTA points on MOT17 and MOT20 datasets.\nCode is available at https://github.com/pingyang1117/SMILEtrack_Official\n","authors":["Yu-Hsiang Wang","Jun-Wei Hsieh","Ping-Yang Chen","Ming-Ching Chang","Hung Hin So","Xin Li"],"pdf_url":"https://arxiv.org/pdf/2211.08824v4.pdf","comment":"Our paper was accepted by AAAI2024"},{"id":"http://arxiv.org/abs/2401.11719v1","updated":"2024-01-22T06:43:13Z","published":"2024-01-22T06:43:13Z","title":"SFC: Shared Feature Calibration in Weakly Supervised Semantic\n Segmentation","summary":" Image-level weakly supervised semantic segmentation has received increasing\nattention due to its low annotation cost. Existing methods mainly rely on Class\nActivation Mapping (CAM) to obtain pseudo-labels for training semantic\nsegmentation models. In this work, we are the first to demonstrate that\nlong-tailed distribution in training data can cause the CAM calculated through\nclassifier weights over-activated for head classes and under-activated for tail\nclasses due to the shared features among head- and tail- classes. This degrades\npseudo-label quality and further influences final semantic segmentation\nperformance. To address this issue, we propose a Shared Feature Calibration\n(SFC) method for CAM generation. Specifically, we leverage the class prototypes\nthat carry positive shared features and propose a Multi-Scaled\nDistribution-Weighted (MSDW) consistency loss for narrowing the gap between the\nCAMs generated through classifier weights and class prototypes during training.\nThe MSDW loss counterbalances over-activation and under-activation by\ncalibrating the shared features in head-/tail-class classifier weights.\nExperimental results show that our SFC significantly improves CAM boundaries\nand achieves new state-of-the-art performances. The project is available at\nhttps://github.com/Barrett-python/SFC.\n","authors":["Xinqiao Zhao","Feilong Tang","Xiaoyang Wang","Jimin Xiao"],"pdf_url":"https://arxiv.org/pdf/2401.11719v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11718v1","updated":"2024-01-22T06:42:23Z","published":"2024-01-22T06:42:23Z","title":"MsSVT++: Mixed-scale Sparse Voxel Transformer with Center Voting for 3D\n Object Detection","summary":" Accurate 3D object detection in large-scale outdoor scenes, characterized by\nconsiderable variations in object scales, necessitates features rich in both\nlong-range and fine-grained information. While recent detectors have utilized\nwindow-based transformers to model long-range dependencies, they tend to\noverlook fine-grained details. To bridge this gap, we propose MsSVT++, an\ninnovative Mixed-scale Sparse Voxel Transformer that simultaneously captures\nboth types of information through a divide-and-conquer approach. This approach\ninvolves explicitly dividing attention heads into multiple groups, each\nresponsible for attending to information within a specific range. The outputs\nof these groups are subsequently merged to obtain final mixed-scale features.\nTo mitigate the computational complexity associated with applying a\nwindow-based transformer in 3D voxel space, we introduce a novel Chessboard\nSampling strategy and implement voxel sampling and gathering operations\nsparsely using a hash map. Moreover, an important challenge stems from the\nobservation that non-empty voxels are primarily located on the surface of\nobjects, which impedes the accurate estimation of bounding boxes. To overcome\nthis challenge, we introduce a Center Voting module that integrates newly voted\nvoxels enriched with mixed-scale contextual information towards the centers of\nthe objects, thereby improving precise object localization. Extensive\nexperiments demonstrate that our single-stage detector, built upon the\nfoundation of MsSVT++, consistently delivers exceptional performance across\ndiverse datasets.\n","authors":["Jianan Li","Shaocong Dong","Lihe Ding","Tingfa Xu"],"pdf_url":"https://arxiv.org/pdf/2401.11718v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2204.03842v4","updated":"2024-01-22T06:30:15Z","published":"2022-04-08T05:11:04Z","title":"From 2D Images to 3D Model:Weakly Supervised Multi-View Face\n Reconstruction with Deep Fusion","summary":" While weakly supervised multi-view face reconstruction (MVR) is garnering\nincreased attention, one critical issue still remains open: how to effectively\nfuse multiple image information to reconstruct high-precision 3D models. In\nthis regard, we propose a novel model called Deep Fusion MVR (DF-MVR) to\nreconstruct high-precision 3D facial shapes from multi-view images.\nSpecifically, we introduce MulEn-Unet, a multi-view encoding to single decoding\nframework with skip connections and attention. This design allows for the\nextraction, integration, and compensation of deep features with attention from\nmulti-view images. Furthermore, we adopt the involution kernel to enrich deep\nfusion features with channel features. In addition, we develop the face parse\nnetwork to learn, identify, and emphasize the critical common face area within\nmulti-view images. Experiments on Pixel-Face and Bosphorus datasets indicate\nthe superiority of our model. Without 3D annotation, DF-MVR achieves 5.2% and\n3.0% RMSE improvement over the existing weakly supervised MVRs respectively on\nPixel-Face and Bosphorus dataset. Code will be available publicly at\nhttps://github.com/weiguangzhao/DF_MVR.\n","authors":["Weiguang Zhao","Chaolong Yang","Jianan Ye","Rui Zhang","Yuyao Yan","Xi Yang","Bin Dong","Amir Hussain","Kaizhu Huang"],"pdf_url":"https://arxiv.org/pdf/2204.03842v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11713v1","updated":"2024-01-22T06:29:52Z","published":"2024-01-22T06:29:52Z","title":"Medical Image Debiasing by Learning Adaptive Agreement from a Biased\n Council","summary":" Deep learning could be prone to learning shortcuts raised by dataset bias and\nresult in inaccurate, unreliable, and unfair models, which impedes its adoption\nin real-world clinical applications. Despite its significance, there is a\ndearth of research in the medical image classification domain to address\ndataset bias. Furthermore, the bias labels are often agnostic, as identifying\nbiases can be laborious and depend on post-hoc interpretation. This paper\nproposes learning Adaptive Agreement from a Biased Council (Ada-ABC), a\ndebiasing framework that does not rely on explicit bias labels to tackle\ndataset bias in medical images. Ada-ABC develops a biased council consisting of\nmultiple classifiers optimized with generalized cross entropy loss to learn the\ndataset bias. A debiasing model is then simultaneously trained under the\nguidance of the biased council. Specifically, the debiasing model is required\nto learn adaptive agreement with the biased council by agreeing on the\ncorrectly predicted samples and disagreeing on the wrongly predicted samples by\nthe biased council. In this way, the debiasing model could learn the target\nattribute on the samples without spurious correlations while also avoiding\nignoring the rich information in samples with spurious correlations. We\ntheoretically demonstrated that the debiasing model could learn the target\nfeatures when the biased model successfully captures dataset bias. Moreover, to\nour best knowledge, we constructed the first medical debiasing benchmark from\nfour datasets containing seven different bias scenarios. Our extensive\nexperiments practically showed that our proposed Ada-ABC outperformed\ncompetitive approaches, verifying its effectiveness in mitigating dataset bias\nfor medical image classification. The codes and organized benchmark datasets\nwill be made publicly available.\n","authors":["Luyang Luo","Xin Huang","Minghao Wang","Zhuoyue Wan","Hao Chen"],"pdf_url":"https://arxiv.org/pdf/2401.11713v1.pdf","comment":"10 pages, 5 figures, 3 tables. Code and benchmark will be released\n via https://github.com/LLYXC/Ada-ABC/tree/main"},{"id":"http://arxiv.org/abs/2401.11711v1","updated":"2024-01-22T06:28:08Z","published":"2024-01-22T06:28:08Z","title":"HG3-NeRF: Hierarchical Geometric, Semantic, and Photometric Guided\n Neural Radiance Fields for Sparse View Inputs","summary":" Neural Radiance Fields (NeRF) have garnered considerable attention as a\nparadigm for novel view synthesis by learning scene representations from\ndiscrete observations. Nevertheless, NeRF exhibit pronounced performance\ndegradation when confronted with sparse view inputs, consequently curtailing\nits further applicability. In this work, we introduce Hierarchical Geometric,\nSemantic, and Photometric Guided NeRF (HG3-NeRF), a novel methodology that can\naddress the aforementioned limitation and enhance consistency of geometry,\nsemantic content, and appearance across different views. We propose\nHierarchical Geometric Guidance (HGG) to incorporate the attachment of\nStructure from Motion (SfM), namely sparse depth prior, into the scene\nrepresentations. Different from direct depth supervision, HGG samples volume\npoints from local-to-global geometric regions, mitigating the misalignment\ncaused by inherent bias in the depth prior. Furthermore, we draw inspiration\nfrom notable variations in semantic consistency observed across images of\ndifferent resolutions and propose Hierarchical Semantic Guidance (HSG) to learn\nthe coarse-to-fine semantic content, which corresponds to the coarse-to-fine\nscene representations. Experimental results demonstrate that HG3-NeRF can\noutperform other state-of-the-art methods on different standard benchmarks and\nachieve high-fidelity synthesis results for sparse view inputs.\n","authors":["Zelin Gao","Weichen Dai","Yu Zhang"],"pdf_url":"https://arxiv.org/pdf/2401.11711v1.pdf","comment":"13 pages, 6 figures"},{"id":"http://arxiv.org/abs/2401.11708v1","updated":"2024-01-22T06:16:29Z","published":"2024-01-22T06:16:29Z","title":"Mastering Text-to-Image Diffusion: Recaptioning, Planning, and\n Generating with Multimodal LLMs","summary":" Diffusion models have exhibit exceptional performance in text-to-image\ngeneration and editing. However, existing methods often face challenges when\nhandling complex text prompts that involve multiple objects with multiple\nattributes and relationships. In this paper, we propose a brand new\ntraining-free text-to-image generation/editing framework, namely Recaption,\nPlan and Generate (RPG), harnessing the powerful chain-of-thought reasoning\nability of multimodal LLMs to enhance the compositionality of text-to-image\ndiffusion models. Our approach employs the MLLM as a global planner to\ndecompose the process of generating complex images into multiple simpler\ngeneration tasks within subregions. We propose complementary regional diffusion\nto enable region-wise compositional generation. Furthermore, we integrate\ntext-guided image generation and editing within the proposed RPG in a\nclosed-loop fashion, thereby enhancing generalization ability. Extensive\nexperiments demonstrate our RPG outperforms state-of-the-art text-to-image\ndiffusion models, including DALL-E 3 and SDXL, particularly in multi-category\nobject composition and text-image semantic alignment. Notably, our RPG\nframework exhibits wide compatibility with various MLLM architectures (e.g.,\nMiniGPT-4) and diffusion backbones (e.g., ControlNet). Our code is available\nat: https://github.com/YangLing0818/RPG-DiffusionMaster\n","authors":["Ling Yang","Zhaochen Yu","Chenlin Meng","Minkai Xu","Stefano Ermon","Bin Cui"],"pdf_url":"https://arxiv.org/pdf/2401.11708v1.pdf","comment":"Project: https://github.com/YangLing0818/RPG-DiffusionMaster"},{"id":"http://arxiv.org/abs/2401.11704v1","updated":"2024-01-22T06:05:26Z","published":"2024-01-22T06:05:26Z","title":"EK-Net:Real-time Scene Text Detection with Expand Kernel Distance","summary":" Recently, scene text detection has received significant attention due to its\nwide application. However, accurate detection in complex scenes of multiple\nscales, orientations, and curvature remains a challenge. Numerous detection\nmethods adopt the Vatti clipping (VC) algorithm for multiple-instance training\nto address the issue of arbitrary-shaped text. Yet we identify several bias\nresults from these approaches called the \"shrinked kernel\". Specifically, it\nrefers to a decrease in accuracy resulting from an output that overly favors\nthe text kernel. In this paper, we propose a new approach named Expand Kernel\nNetwork (EK-Net) with expand kernel distance to compensate for the previous\ndeficiency, which includes three-stages regression to complete instance\ndetection. Moreover, EK-Net not only realize the precise positioning of\narbitrary-shaped text, but also achieve a trade-off between performance and\nspeed. Evaluation results demonstrate that EK-Net achieves state-of-the-art or\ncompetitive performance compared to other advanced methods, e.g., F-measure of\n85.72% at 35.42 FPS on ICDAR 2015, F-measure of 85.75% at 40.13 FPS on CTW1500.\n","authors":["Boyuan Zhu","Fagui Liu","Xi Chen","Quan Tang"],"pdf_url":"https://arxiv.org/pdf/2401.11704v1.pdf","comment":"2024 IEEE International Conference on Acoustics, Speech and Signal\n Processing"},{"id":"http://arxiv.org/abs/2304.03047v3","updated":"2024-01-22T04:57:32Z","published":"2023-04-06T13:07:17Z","title":"ETPNav: Evolving Topological Planning for Vision-Language Navigation in\n Continuous Environments","summary":" Vision-language navigation is a task that requires an agent to follow\ninstructions to navigate in environments. It becomes increasingly crucial in\nthe field of embodied AI, with potential applications in autonomous navigation,\nsearch and rescue, and human-robot interaction. In this paper, we propose to\naddress a more practical yet challenging counterpart setting - vision-language\nnavigation in continuous environments (VLN-CE). To develop a robust VLN-CE\nagent, we propose a new navigation framework, ETPNav, which focuses on two\ncritical skills: 1) the capability to abstract environments and generate\nlong-range navigation plans, and 2) the ability of obstacle-avoiding control in\ncontinuous environments. ETPNav performs online topological mapping of\nenvironments by self-organizing predicted waypoints along a traversed path,\nwithout prior environmental experience. It privileges the agent to break down\nthe navigation procedure into high-level planning and low-level control.\nConcurrently, ETPNav utilizes a transformer-based cross-modal planner to\ngenerate navigation plans based on topological maps and instructions. The plan\nis then performed through an obstacle-avoiding controller that leverages a\ntrial-and-error heuristic to prevent navigation from getting stuck in\nobstacles. Experimental results demonstrate the effectiveness of the proposed\nmethod. ETPNav yields more than 10% and 20% improvements over prior\nstate-of-the-art on R2R-CE and RxR-CE datasets, respectively. Our code is\navailable at https://github.com/MarSaKi/ETPNav.\n","authors":["Dong An","Hanqing Wang","Wenguan Wang","Zun Wang","Yan Huang","Keji He","Liang Wang"],"pdf_url":"https://arxiv.org/pdf/2304.03047v3.pdf","comment":"Project page: https://github.com/MarSaKi/ETPNav"},{"id":"http://arxiv.org/abs/2401.11687v1","updated":"2024-01-22T04:54:42Z","published":"2024-01-22T04:54:42Z","title":"TIM: An Efficient Temporal Interaction Module for Spiking Transformer","summary":" Spiking Neural Networks (SNNs), as the third generation of neural networks,\nhave gained prominence for their biological plausibility and computational\nefficiency, especially in processing diverse datasets. The integration of\nattention mechanisms, inspired by advancements in neural network architectures,\nhas led to the development of Spiking Transformers. These have shown promise in\nenhancing SNNs' capabilities, particularly in the realms of both static and\nneuromorphic datasets. Despite their progress, a discernible gap exists in\nthese systems, specifically in the Spiking Self Attention (SSA) mechanism's\neffectiveness in leveraging the temporal processing potential of SNNs. To\naddress this, we introduce the Temporal Interaction Module (TIM), a novel,\nconvolution-based enhancement designed to augment the temporal data processing\nabilities within SNN architectures. TIM's integration into existing SNN\nframeworks is seamless and efficient, requiring minimal additional parameters\nwhile significantly boosting their temporal information handling capabilities.\nThrough rigorous experimentation, TIM has demonstrated its effectiveness in\nexploiting temporal information, leading to state-of-the-art performance across\nvarious neuromorphic datasets.\n","authors":["Sicheng Shen","Dongcheng Zhao","Guobin Shen","Yi Zeng"],"pdf_url":"https://arxiv.org/pdf/2401.11687v1.pdf","comment":"10pages,6figures"},{"id":"http://arxiv.org/abs/2310.09221v2","updated":"2024-01-22T04:48:57Z","published":"2023-10-13T16:18:48Z","title":"Ultrasound Image Segmentation of Thyroid Nodule via Latent Semantic\n Feature Co-Registration","summary":" Segmentation of nodules in thyroid ultrasound imaging plays a crucial role in\nthe detection and treatment of thyroid cancer. However, owing to the diversity\nof scanner vendors and imaging protocols in different hospitals, the automatic\nsegmentation model, which has already demonstrated expert-level accuracy in the\nfield of medical image segmentation, finds its accuracy reduced as the result\nof its weak generalization performance when being applied in clinically\nrealistic environments. To address this issue, the present paper proposes ASTN,\na framework for thyroid nodule segmentation achieved through a new type\nco-registration network. By extracting latent semantic information from the\natlas and target images and utilizing in-depth features to accomplish the\nco-registration of nodules in thyroid ultrasound images, this framework can\nensure the integrity of anatomical structure and reduce the impact on\nsegmentation as the result of overall differences in image caused by different\ndevices. In addition, this paper also provides an atlas selection algorithm to\nmitigate the difficulty of co-registration. As shown by the evaluation results\ncollected from the datasets of different devices, thanks to the method we\nproposed, the model generalization has been greatly improved while maintaining\na high level of segmentation accuracy.\n","authors":["Xuewei Li","Yaqiao Zhu","Jie Gao","Xi Wei","Ruixuan Zhang","Yuan Tian","ZhiQiang Liu"],"pdf_url":"https://arxiv.org/pdf/2310.09221v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.07278v2","updated":"2024-01-22T04:43:04Z","published":"2024-01-14T12:22:34Z","title":"Semi-supervised Semantic Segmentation using Redesigned Self-Training for\n White Blood Cell","summary":" Artificial Intelligence (AI) in healthcare, especially in white blood cell\ncancer diagnosis, is hindered by two primary challenges: the lack of\nlarge-scale labeled datasets for white blood cell (WBC) segmentation and\noutdated segmentation methods. To address the first challenge, a\nsemi-supervised learning framework should be brought to efficiently annotate\nthe large dataset. In this work, we address this issue by proposing a novel\nself-training pipeline with the incorporation of FixMatch. We discover that by\nincorporating FixMatch in the self-training pipeline, the performance improves\nin the majority of cases. Our performance achieved the best performance with\nthe self-training scheme with consistency on DeepLab-V3 architecture and\nResNet-50, reaching 90.69%, 87.37%, and 76.49% on Zheng 1, Zheng 2, and LISC\ndatasets, respectively.\n","authors":["Vinh Quoc Luu","Duy Khanh Le","Huy Thanh Nguyen","Minh Thanh Nguyen","Thinh Tien Nguyen","Vinh Quang Dinh"],"pdf_url":"https://arxiv.org/pdf/2401.07278v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11674v1","updated":"2024-01-22T03:24:45Z","published":"2024-01-22T03:24:45Z","title":"Memory-Efficient Prompt Tuning for Incremental Histopathology\n Classification","summary":" Recent studies have made remarkable progress in histopathology\nclassification. Based on current successes, contemporary works proposed to\nfurther upgrade the model towards a more generalizable and robust direction\nthrough incrementally learning from the sequentially delivered domains. Unlike\nprevious parameter isolation based approaches that usually demand massive\ncomputation resources during model updating, we present a memory-efficient\nprompt tuning framework to cultivate model generalization potential in\neconomical memory cost. For each incoming domain, we reuse the existing\nparameters of the initial classification model and attach lightweight trainable\nprompts into it for customized tuning. Considering the domain heterogeneity, we\nperform decoupled prompt tuning, where we adopt a domain-specific prompt for\neach domain to independently investigate its distinctive characteristics, and\none domain-invariant prompt shared across all domains to continually explore\nthe common content embedding throughout time. All domain-specific prompts will\nbe appended to the prompt bank and isolated from further changes to prevent\nforgetting the distinctive features of early-seen domains. While the\ndomain-invariant prompt will be passed on and iteratively evolve by\nstyle-augmented prompt refining to improve model generalization capability over\ntime. In specific, we construct a graph with existing prompts and build a\nstyle-augmented graph attention network to guide the domain-invariant prompt\nexploring the overlapped latent embedding among all delivered domains for more\ndomain generic representations. We have extensively evaluated our framework\nwith two histopathology tasks, i.e., breast cancer metastasis classification\nand epithelium-stroma tissue classification, where our approach yielded\nsuperior performance and memory efficiency over the competing methods.\n","authors":["Yu Zhu","Kang Li","Lequan Yu","Pheng-Ann Heng"],"pdf_url":"https://arxiv.org/pdf/2401.11674v1.pdf","comment":"Accepted by AAAI 2024"},{"id":"http://arxiv.org/abs/2401.11673v1","updated":"2024-01-22T03:22:49Z","published":"2024-01-22T03:22:49Z","title":"MVSFormer++: Revealing the Devil in Transformer's Details for Multi-View\n Stereo","summary":" Recent advancements in learning-based Multi-View Stereo (MVS) methods have\nprominently featured transformer-based models with attention mechanisms.\nHowever, existing approaches have not thoroughly investigated the profound\ninfluence of transformers on different MVS modules, resulting in limited depth\nestimation capabilities. In this paper, we introduce MVSFormer++, a method that\nprudently maximizes the inherent characteristics of attention to enhance\nvarious components of the MVS pipeline. Formally, our approach involves\ninfusing cross-view information into the pre-trained DINOv2 model to facilitate\nMVS learning. Furthermore, we employ different attention mechanisms for the\nfeature encoder and cost volume regularization, focusing on feature and spatial\naggregations respectively. Additionally, we uncover that some design details\nwould substantially impact the performance of transformer modules in MVS,\nincluding normalized 3D positional encoding, adaptive attention scaling, and\nthe position of layer normalization. Comprehensive experiments on DTU,\nTanks-and-Temples, BlendedMVS, and ETH3D validate the effectiveness of the\nproposed method. Notably, MVSFormer++ achieves state-of-the-art performance on\nthe challenging DTU and Tanks-and-Temples benchmarks.\n","authors":["Chenjie Cao","Xinlin Ren","Yanwei Fu"],"pdf_url":"https://arxiv.org/pdf/2401.11673v1.pdf","comment":"Accepted to ICLR2024"},{"id":"http://arxiv.org/abs/2310.01852v7","updated":"2024-01-22T03:11:15Z","published":"2023-10-03T07:33:27Z","title":"LanguageBind: Extending Video-Language Pretraining to N-modality by\n Language-based Semantic Alignment","summary":" The video-language (VL) pretraining has achieved remarkable improvement in\nmultiple downstream tasks. However, the current VL pretraining framework is\nhard to extend to multiple modalities (N modalities, N>=3) beyond vision and\nlanguage. We thus propose LanguageBind, taking the language as the bind across\ndifferent modalities because the language modality is well-explored and\ncontains rich semantics. Specifically, we freeze the language encoder acquired\nby VL pretraining, then train encoders for other modalities with contrastive\nlearning. As a result, all modalities are mapped to a shared feature space,\nimplementing multi-modal semantic alignment. While LanguageBind ensures that we\ncan extend VL modalities to N modalities, we also need a high-quality dataset\nwith alignment data pairs centered on language. We thus propose VIDAL-10M with\nVideo, Infrared, Depth, Audio and their corresponding Language, naming as\nVIDAL-10M. In our VIDAL-10M, all videos are from short video platforms with\ncomplete semantics rather than truncated segments from long videos, and all the\nvideo, depth, infrared, and audio modalities are aligned to their textual\ndescriptions. LanguageBind has achieved superior performance on a wide range of\n15 benchmarks covering video, audio, depth, and infrared. Moreover, multiple\nexperiments have provided evidence for the effectiveness of LanguageBind in\nachieving indirect alignment and complementarity among diverse modalities. Code\naddress: https://github.com/PKU-YuanGroup/LanguageBind\n","authors":["Bin Zhu","Bin Lin","Munan Ning","Yang Yan","Jiaxi Cui","HongFa Wang","Yatian Pang","Wenhao Jiang","Junwu Zhang","Zongwei Li","Wancai Zhang","Zhifeng Li","Wei Liu","Li Yuan"],"pdf_url":"https://arxiv.org/pdf/2310.01852v7.pdf","comment":"Accepted by ICLR 2024"},{"id":"http://arxiv.org/abs/2401.11671v1","updated":"2024-01-22T03:09:00Z","published":"2024-01-22T03:09:00Z","title":"RTA-Former: Reverse Transformer Attention for Polyp Segmentation","summary":" Polyp segmentation is a key aspect of colorectal cancer prevention, enabling\nearly detection and guiding subsequent treatments. Intelligent diagnostic\ntools, including deep learning solutions, are widely explored to streamline and\npotentially automate this process. However, even with many powerful network\narchitectures, there still comes the problem of producing accurate edge\nsegmentation. In this paper, we introduce a novel network, namely RTA-Former,\nthat employs a transformer model as the encoder backbone and innovatively\nadapts Reverse Attention (RA) with a transformer stage in the decoder for\nenhanced edge segmentation. The results of the experiments illustrate that\nRTA-Former achieves state-of-the-art (SOTA) performance in five polyp\nsegmentation datasets. The strong capability of RTA-Former holds promise in\nimproving the accuracy of Transformer-based polyp segmentation, potentially\nleading to better clinical decisions and patient outcomes. Our code will be\npublicly available on GitHub.\n","authors":["Zhikai Li","Murong Yi","Ali Uneri","Sihan Niu","Craig Jones"],"pdf_url":"https://arxiv.org/pdf/2401.11671v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2301.08898v3","updated":"2024-01-22T03:01:28Z","published":"2023-01-21T05:34:29Z","title":"Recurrent Generic Contour-based Instance Segmentation with Progressive\n Learning","summary":" Contour-based instance segmentation has been actively studied, thanks to its\nflexibility and elegance in processing visual objects within complex\nbackgrounds. In this work, we propose a novel deep network architecture, i.e.,\nPolySnake, for generic contour-based instance segmentation. Motivated by the\nclassic Snake algorithm, the proposed PolySnake achieves superior and robust\nsegmentation performance with an iterative and progressive contour refinement\nstrategy. Technically, PolySnake introduces a recurrent update operator to\nestimate the object contour iteratively. It maintains a single estimate of the\ncontour that is progressively deformed toward the object boundary. At each\niteration, PolySnake builds a semantic-rich representation for the current\ncontour and feeds it to the recurrent operator for further contour adjustment.\nThrough the iterative refinements, the contour progressively converges to a\nstable status that tightly encloses the object instance. Beyond the scope of\ngeneral instance segmentation, extensive experiments are conducted to validate\nthe effectiveness and generalizability of our PolySnake in two additional\nspecific task scenarios, including scene text detection and lane detection. The\nresults demonstrate that the proposed PolySnake outperforms the existing\nadvanced methods on several multiple prevalent benchmarks across the three\ntasks. The codes and pre-trained models are available at\nhttps://github.com/fh2019ustc/PolySnake\n","authors":["Hao Feng","Keyi Zhou","Wengang Zhou","Yufei Yin","Jiajun Deng","Qi Sun","Houqiang Li"],"pdf_url":"https://arxiv.org/pdf/2301.08898v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.07444v3","updated":"2024-01-22T02:56:05Z","published":"2023-04-15T01:33:14Z","title":"The Art of Camouflage: Few-shot Learning for Animal Detection and\n Segmentation","summary":" Camouflaged object detection and segmentation is a new and challenging\nresearch topic in computer vision. There is a serious issue of lacking data of\ncamouflaged objects such as camouflaged animals in natural scenes. In this\npaper, we address the problem of few-shot learning for camouflaged object\ndetection and segmentation. To this end, we first collect a new dataset,\nCAMO-FS, for the benchmark. We then propose a novel method to efficiently\ndetect and segment the camouflaged objects in the images. In particular, we\nintroduce the instance triplet loss and the instance memory storage. The\nextensive experiments demonstrated that our proposed method achieves\nstate-of-the-art performance on the newly collected dataset.\n","authors":["Thanh-Danh Nguyen","Anh-Khoa Nguyen Vu","Nhat-Duy Nguyen","Vinh-Tiep Nguyen","Thanh Duc Ngo","Thanh-Toan Do","Minh-Triet Tran","Tam V. Nguyen"],"pdf_url":"https://arxiv.org/pdf/2304.07444v3.pdf","comment":"Under-review Journal"},{"id":"http://arxiv.org/abs/2305.16789v2","updated":"2024-01-22T02:47:50Z","published":"2023-05-26T09:59:48Z","title":"Modulate Your Spectrum in Self-Supervised Learning","summary":" Whitening loss offers a theoretical guarantee against feature collapse in\nself-supervised learning (SSL) with joint embedding architectures. Typically,\nit involves a hard whitening approach, transforming the embedding and applying\nloss to the whitened output. In this work, we introduce Spectral Transformation\n(ST), a framework to modulate the spectrum of embedding and to seek for\nfunctions beyond whitening that can avoid dimensional collapse. We show that\nwhitening is a special instance of ST by definition, and our empirical\ninvestigations unveil other ST instances capable of preventing collapse.\nAdditionally, we propose a novel ST instance named IterNorm with trace loss\n(INTL). Theoretical analysis confirms INTL's efficacy in preventing collapse\nand modulating the spectrum of embedding toward equal-eigenvalues during\noptimization. Our experiments on ImageNet classification and COCO object\ndetection demonstrate INTL's potential in learning superior representations.\nThe code is available at https://github.com/winci-ai/INTL.\n","authors":["Xi Weng","Yunhao Ni","Tengwei Song","Jie Luo","Rao Muhammad Anwer","Salman Khan","Fahad Shahbaz Khan","Lei Huang"],"pdf_url":"https://arxiv.org/pdf/2305.16789v2.pdf","comment":"Accepted at ICLR 2024. The code is available at\n https://github.com/winci-ai/intl"},{"id":"http://arxiv.org/abs/2401.10150v3","updated":"2024-01-22T02:40:52Z","published":"2024-01-18T17:22:37Z","title":"Motion-Zero: Zero-Shot Moving Object Control Framework for\n Diffusion-Based Video Generation","summary":" Recent large-scale pre-trained diffusion models have demonstrated a powerful\ngenerative ability to produce high-quality videos from detailed text\ndescriptions. However, exerting control over the motion of objects in videos\ngenerated by any video diffusion model is a challenging problem. In this paper,\nwe propose a novel zero-shot moving object trajectory control framework,\nMotion-Zero, to enable a bounding-box-trajectories-controlled text-to-video\ndiffusion model. To this end, an initial noise prior module is designed to\nprovide a position-based prior to improve the stability of the appearance of\nthe moving object and the accuracy of position. In addition, based on the\nattention map of the U-net, spatial constraints are directly applied to the\ndenoising process of diffusion models, which further ensures the positional and\nspatial consistency of moving objects during the inference. Furthermore,\ntemporal consistency is guaranteed with a proposed shift temporal attention\nmechanism. Our method can be flexibly applied to various state-of-the-art video\ndiffusion models without any training process. Extensive experiments\ndemonstrate our proposed method can control the motion trajectories of objects\nand generate high-quality videos.\n","authors":["Changgu Chen","Junwei Shu","Lianggangxu Chen","Gaoqi He","Changbo Wang","Yang Li"],"pdf_url":"https://arxiv.org/pdf/2401.10150v3.pdf","comment":"Preprint"},{"id":"http://arxiv.org/abs/2401.11654v1","updated":"2024-01-22T02:21:26Z","published":"2024-01-22T02:21:26Z","title":"ActionHub: A Large-scale Action Video Description Dataset for Zero-shot\n Action Recognition","summary":" Zero-shot action recognition (ZSAR) aims to learn an alignment model between\nvideos and class descriptions of seen actions that is transferable to unseen\nactions. The text queries (class descriptions) used in existing ZSAR works,\nhowever, are often short action names that fail to capture the rich semantics\nin the videos, leading to misalignment. With the intuition that video content\ndescriptions (e.g., video captions) can provide rich contextual information of\nvisual concepts in videos, we propose to utilize human annotated video\ndescriptions to enrich the semantics of the class descriptions of each action.\nHowever, all existing action video description datasets are limited in terms of\nthe number of actions, the semantics of video descriptions, etc. To this end,\nwe collect a large-scale action video descriptions dataset named ActionHub,\nwhich covers a total of 1,211 common actions and provides 3.6 million action\nvideo descriptions. With the proposed ActionHub dataset, we further propose a\nnovel Cross-modality and Cross-action Modeling (CoCo) framework for ZSAR, which\nconsists of a Dual Cross-modality Alignment module and a Cross-action\nInvariance Mining module. Specifically, the Dual Cross-modality Alignment\nmodule utilizes both action labels and video descriptions from ActionHub to\nobtain rich class semantic features for feature alignment. The Cross-action\nInvariance Mining module exploits a cycle-reconstruction process between the\nclass semantic feature spaces of seen actions and unseen actions, aiming to\nguide the model to learn cross-action invariant representations. Extensive\nexperimental results demonstrate that our CoCo framework significantly\noutperforms the state-of-the-art on three popular ZSAR benchmarks (i.e.,\nKinetics-ZSAR, UCF101 and HMDB51) under two different learning protocols in\nZSAR. We will release our code, models, and the proposed ActionHub dataset.\n","authors":["Jiaming Zhou","Junwei Liang","Kun-Yu Lin","Jinrui Yang","Wei-Shi Zheng"],"pdf_url":"https://arxiv.org/pdf/2401.11654v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11652v1","updated":"2024-01-22T02:17:36Z","published":"2024-01-22T02:17:36Z","title":"OnDev-LCT: On-Device Lightweight Convolutional Transformers towards\n federated learning","summary":" Federated learning (FL) has emerged as a promising approach to\ncollaboratively train machine learning models across multiple edge devices\nwhile preserving privacy. The success of FL hinges on the efficiency of\nparticipating models and their ability to handle the unique challenges of\ndistributed learning. While several variants of Vision Transformer (ViT) have\nshown great potential as alternatives to modern convolutional neural networks\n(CNNs) for centralized training, the unprecedented size and higher\ncomputational demands hinder their deployment on resource-constrained edge\ndevices, challenging their widespread application in FL. Since client devices\nin FL typically have limited computing resources and communication bandwidth,\nmodels intended for such devices must strike a balance between model size,\ncomputational efficiency, and the ability to adapt to the diverse and non-IID\ndata distributions encountered in FL. To address these challenges, we propose\nOnDev-LCT: Lightweight Convolutional Transformers for On-Device vision tasks\nwith limited training data and resources. Our models incorporate image-specific\ninductive biases through the LCT tokenizer by leveraging efficient depthwise\nseparable convolutions in residual linear bottleneck blocks to extract local\nfeatures, while the multi-head self-attention (MHSA) mechanism in the LCT\nencoder implicitly facilitates capturing global representations of images.\nExtensive experiments on benchmark image datasets indicate that our models\noutperform existing lightweight vision models while having fewer parameters and\nlower computational demands, making them suitable for FL scenarios with data\nheterogeneity and communication bottlenecks.\n","authors":["Chu Myaet Thwal","Minh N. H. Nguyen","Ye Lin Tun","Seong Tae Kim","My T. Thai","Choong Seon Hong"],"pdf_url":"https://arxiv.org/pdf/2401.11652v1.pdf","comment":"Published in Neural Networks"},{"id":"http://arxiv.org/abs/2401.11650v1","updated":"2024-01-22T02:05:33Z","published":"2024-01-22T02:05:33Z","title":"PointGL: A Simple Global-Local Framework for Efficient Point Cloud\n Analysis","summary":" Efficient analysis of point clouds holds paramount significance in real-world\n3D applications. Currently, prevailing point-based models adhere to the\nPointNet++ methodology, which involves embedding and abstracting point features\nwithin a sequence of spatially overlapping local point sets, resulting in\nnoticeable computational redundancy. Drawing inspiration from the streamlined\nparadigm of pixel embedding followed by regional pooling in Convolutional\nNeural Networks (CNNs), we introduce a novel, uncomplicated yet potent\narchitecture known as PointGL, crafted to facilitate efficient point cloud\nanalysis. PointGL employs a hierarchical process of feature acquisition through\ntwo recursive steps. First, the Global Point Embedding leverages\nstraightforward residual Multilayer Perceptrons (MLPs) to effectuate feature\nembedding for each individual point. Second, the novel Local Graph Pooling\ntechnique characterizes point-to-point relationships and abstracts regional\nrepresentations through succinct local graphs. The harmonious fusion of\none-time point embedding and parameter-free graph pooling contributes to\nPointGL's defining attributes of minimized model complexity and heightened\nefficiency. Our PointGL attains state-of-the-art accuracy on the ScanObjectNN\ndataset while exhibiting a runtime that is more than 5 times faster and\nutilizing only approximately 4% of the FLOPs and 30% of the parameters compared\nto the recent PointMLP model. The code for PointGL is available at\nhttps://github.com/Roywangj/PointGL.\n","authors":["Jianan Li","Jie Wang","Tingfa Xu"],"pdf_url":"https://arxiv.org/pdf/2401.11650v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11649v1","updated":"2024-01-22T02:03:31Z","published":"2024-01-22T02:03:31Z","title":"M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action\n Recognition","summary":" Recently, the rise of large-scale vision-language pretrained models like\nCLIP, coupled with the technology of Parameter-Efficient FineTuning (PEFT), has\ncaptured substantial attraction in video action recognition. Nevertheless,\nprevailing approaches tend to prioritize strong supervised performance at the\nexpense of compromising the models' generalization capabilities during\ntransfer. In this paper, we introduce a novel Multimodal, Multi-task CLIP\nadapting framework named \\name to address these challenges, preserving both\nhigh supervised performance and robust transferability. Firstly, to enhance the\nindividual modality architectures, we introduce multimodal adapters to both the\nvisual and text branches. Specifically, we design a novel visual TED-Adapter,\nthat performs global Temporal Enhancement and local temporal Difference\nmodeling to improve the temporal representation capabilities of the visual\nencoder. Moreover, we adopt text encoder adapters to strengthen the learning of\nsemantic label information. Secondly, we design a multi-task decoder with a\nrich set of supervisory signals to adeptly satisfy the need for strong\nsupervised performance and generalization within a multimodal framework.\nExperimental results validate the efficacy of our approach, demonstrating\nexceptional performance in supervised learning while maintaining strong\ngeneralization in zero-shot scenarios.\n","authors":["Mengmeng Wang","Jiazheng Xing","Boyuan Jiang","Jun Chen","Jianbiao Mei","Xingxing Zuo","Guang Dai","Jingdong Wang","Yong Liu"],"pdf_url":"https://arxiv.org/pdf/2401.11649v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.05482v2","updated":"2024-01-22T01:48:29Z","published":"2023-04-11T20:28:33Z","title":"Computational Pathology: A Survey Review and The Way Forward","summary":" Computational Pathology CPath is an interdisciplinary science that augments\ndevelopments of computational approaches to analyze and model medical\nhistopathology images. The main objective for CPath is to develop\ninfrastructure and workflows of digital diagnostics as an assistive CAD system\nfor clinical pathology, facilitating transformational changes in the diagnosis\nand treatment of cancer that are mainly address by CPath tools. With\nevergrowing developments in deep learning and computer vision algorithms, and\nthe ease of the data flow from digital pathology, currently CPath is witnessing\na paradigm shift. Despite the sheer volume of engineering and scientific works\nbeing introduced for cancer image analysis, there is still a considerable gap\nof adopting and integrating these algorithms in clinical practice. This raises\na significant question regarding the direction and trends that are undertaken\nin CPath. In this article we provide a comprehensive review of more than 800\npapers to address the challenges faced in problem design all-the-way to the\napplication and implementation viewpoints. We have catalogued each paper into a\nmodel-card by examining the key works and challenges faced to layout the\ncurrent landscape in CPath. We hope this helps the community to locate relevant\nworks and facilitate understanding of the field's future directions. In a\nnutshell, we oversee the CPath developments in cycle of stages which are\nrequired to be cohesively linked together to address the challenges associated\nwith such multidisciplinary science. We overview this cycle from different\nperspectives of data-centric, model-centric, and application-centric problems.\nWe finally sketch remaining challenges and provide directions for future\ntechnical developments and clinical integration of CPath\n(https://github.com/AtlasAnalyticsLab/CPath_Survey).\n","authors":["Mahdi S. Hosseini","Babak Ehteshami Bejnordi","Vincent Quoc-Huy Trinh","Danial Hasan","Xingwen Li","Taehyo Kim","Haochen Zhang","Theodore Wu","Kajanan Chinniah","Sina Maghsoudlou","Ryan Zhang","Stephen Yang","Jiadai Zhu","Lyndon Chan","Samir Khaki","Andrei Buin","Fatemeh Chaji","Ala Salehi","Bich Ngoc Nguyen","Dimitris Samaras","Konstantinos N. Plataniotis"],"pdf_url":"https://arxiv.org/pdf/2304.05482v2.pdf","comment":"Accepted in Elsevier Journal of Pathology Informatics (JPI) 2024"},{"id":"http://arxiv.org/abs/2401.11644v1","updated":"2024-01-22T01:34:03Z","published":"2024-01-22T01:34:03Z","title":"Friends Across Time: Multi-Scale Action Segmentation Transformer for\n Surgical Phase Recognition","summary":" Automatic surgical phase recognition is a core technology for modern\noperating rooms and online surgical video assessment platforms. Current\nstate-of-the-art methods use both spatial and temporal information to tackle\nthe surgical phase recognition task. Building on this idea, we propose the\nMulti-Scale Action Segmentation Transformer (MS-AST) for offline surgical phase\nrecognition and the Multi-Scale Action Segmentation Causal Transformer\n(MS-ASCT) for online surgical phase recognition. We use ResNet50 or\nEfficientNetV2-M for spatial feature extraction. Our MS-AST and MS-ASCT can\nmodel temporal information at different scales with multi-scale temporal\nself-attention and multi-scale temporal cross-attention, which enhances the\ncapture of temporal relationships between frames and segments. We demonstrate\nthat our method can achieve 95.26% and 96.15% accuracy on the Cholec80 dataset\nfor online and offline surgical phase recognition, respectively, which achieves\nnew state-of-the-art results. Our method can also achieve state-of-the-art\nresults on non-medical datasets in the video action segmentation domain.\n","authors":["Bokai Zhang","Jiayuan Meng","Bin Cheng","Dean Biskup","Svetlana Petculescu","Angela Chapman"],"pdf_url":"https://arxiv.org/pdf/2401.11644v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.17778v3","updated":"2024-01-22T00:54:30Z","published":"2023-06-30T16:31:14Z","title":"Look, Remember and Reason: Grounded reasoning in videos with language\n models","summary":" Multi-modal language models (LM) have recently shown promising performance in\nhigh-level reasoning tasks on videos. However, existing methods still fall\nshort in tasks like causal or compositional spatiotemporal reasoning over\nactions, in which model predictions need to be grounded in fine-grained\nlow-level details, such as object motions and object interactions. In this\nwork, we propose training an LM end-to-end on low-level surrogate tasks,\nincluding object detection, re-identification, and tracking, to endow the model\nwith the required low-level visual capabilities. We show that a two-stream\nvideo encoder with spatiotemporal attention is effective at capturing the\nrequired static and motion-based cues in the video. By leveraging the LM's\nability to perform the low-level surrogate tasks, we can cast reasoning in\nvideos as the three-step process of Look, Remember, Reason wherein visual\ninformation is extracted using low-level visual skills step-by-step and then\nintegrated to arrive at a final answer. We demonstrate the effectiveness of our\nframework on diverse visual reasoning tasks from the ACRE, CATER,\nSomething-Else and STAR datasets. Our approach is trainable end-to-end and\nsurpasses state-of-the-art task-specific methods across these tasks by a large\nmargin.\n","authors":["Apratim Bhattacharyya","Sunny Panchal","Mingu Lee","Reza Pourreza","Pulkit Madan","Roland Memisevic"],"pdf_url":"https://arxiv.org/pdf/2306.17778v3.pdf","comment":"To appear at ICLR 2024"},{"id":"http://arxiv.org/abs/2309.01409v5","updated":"2024-01-22T00:22:14Z","published":"2023-09-04T07:40:30Z","title":"Implicit Neural Image Stitching","summary":" Existing frameworks for image stitching often provide visually reasonable\nstitchings. However, they suffer from blurry artifacts and disparities in\nillumination, depth level, etc. Although the recent learning-based stitchings\nrelax such disparities, the required methods impose sacrifice of image\nqualities failing to capture high-frequency details for stitched images. To\naddress the problem, we propose a novel approach, implicit Neural Image\nStitching (NIS) that extends arbitrary-scale super-resolution. Our method\nestimates Fourier coefficients of images for quality-enhancing warps. Then, the\nsuggested model blends color mismatches and misalignment in the latent space\nand decodes the features into RGB values of stitched images. Our experiments\nshow that our approach achieves improvement in resolving the low-definition\nimaging of the previous deep image stitching with favorable accelerated\nimage-enhancing methods. Our source code is available at\nhttps://github.com/minshu-kim/NIS.\n","authors":["Minsu Kim","Jaewon Lee","Byeonghun Lee","Sunghoon Im","Kyong Hwan Jin"],"pdf_url":"https://arxiv.org/pdf/2309.01409v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11633v1","updated":"2024-01-22T00:00:30Z","published":"2024-01-22T00:00:30Z","title":"Zoom-shot: Fast and Efficient Unsupervised Zero-Shot Transfer of CLIP to\n Vision Encoders with Multimodal Loss","summary":" The fusion of vision and language has brought about a transformative shift in\ncomputer vision through the emergence of Vision-Language Models (VLMs).\nHowever, the resource-intensive nature of existing VLMs poses a significant\nchallenge. We need an accessible method for developing the next generation of\nVLMs. To address this issue, we propose Zoom-shot, a novel method for\ntransferring the zero-shot capabilities of CLIP to any pre-trained vision\nencoder. We do this by exploiting the multimodal information (i.e. text and\nimage) present in the CLIP latent space through the use of specifically\ndesigned multimodal loss functions. These loss functions are (1)\ncycle-consistency loss and (2) our novel prompt-guided knowledge distillation\nloss (PG-KD). PG-KD combines the concept of knowledge distillation with CLIP's\nzero-shot classification, to capture the interactions between text and image\nfeatures. With our multimodal losses, we train a $\\textbf{linear mapping}$\nbetween the CLIP latent space and the latent space of a pre-trained vision\nencoder, for only a $\\textbf{single epoch}$. Furthermore, Zoom-shot is entirely\nunsupervised and is trained using $\\textbf{unpaired}$ data. We test the\nzero-shot capabilities of a range of vision encoders augmented as new VLMs, on\ncoarse and fine-grained classification datasets, outperforming the previous\nstate-of-the-art in this problem domain. In our ablations, we find Zoom-shot\nallows for a trade-off between data and compute during training; and our\nstate-of-the-art results can be obtained by reducing training from 20% to 1% of\nthe ImageNet training data with 20 epochs. All code and models are available on\nGitHub.\n","authors":["Jordan Shipard","Arnold Wiliem","Kien Nguyen Thanh","Wei Xiang","Clinton Fookes"],"pdf_url":"https://arxiv.org/pdf/2401.11633v1.pdf","comment":"15 pages"},{"id":"http://arxiv.org/abs/2311.02749v2","updated":"2024-01-22T21:30:26Z","published":"2023-11-05T19:59:36Z","title":"Fast Point Cloud to Mesh Reconstruction for Deformable Object Tracking","summary":" The world around us is full of soft objects we perceive and deform with\ndexterous hand movements. For a robotic hand to control soft objects, it has to\nacquire online state feedback of the deforming object. While RGB-D cameras can\ncollect occluded point clouds at a rate of 30Hz, this does not represent a\ncontinuously trackable object surface. Hence, in this work, we developed a\nmethod that takes as input a template mesh which is the mesh of an object in\nits non-deformed state and a deformed point cloud of the same object, and then\nshapes the template mesh such that it matches the deformed point cloud. The\nreconstruction of meshes from point clouds has long been studied in the field\nof Computer graphics under 3D reconstruction and 4D reconstruction, however,\nboth lack the speed and generalizability needed for robotics applications. Our\nmodel is designed using a point cloud auto-encoder and a Real-NVP architecture.\nOur trained model can perform mesh reconstruction and tracking at a rate of\n58Hz on a template mesh of 3000 vertices and a deformed point cloud of 5000\npoints and is generalizable to the deformations of six different object\ncategories which are assumed to be made of soft material in our experiments\n(scissors, hammer, foam brick, cleanser bottle, orange, and dice). The object\nmeshes are taken from the YCB benchmark dataset. An instance of a downstream\napplication can be the control algorithm for a robotic hand that requires\nonline feedback from the state of the manipulated object which would allow\nonline grasp adaptation in a closed-loop manner. Furthermore, the tracking\ncapacity of our method can help in the system identification of deforming\nobjects in a marker-free approach. In future work, we will extend our trained\nmodel to generalize beyond six object categories and additionally to real-world\ndeforming point clouds.\n","authors":["Elham Amin Mansour","Hehui Zheng","Robert K. Katzschmann"],"pdf_url":"https://arxiv.org/pdf/2311.02749v2.pdf","comment":"8 pages with appendix,16 figures"},{"id":"http://arxiv.org/abs/2305.03053v2","updated":"2024-01-22T20:56:16Z","published":"2023-05-04T17:59:58Z","title":"ZipIt! Merging Models from Different Tasks without Training","summary":" Typical deep visual recognition models are capable of performing the one task\nthey were trained on. In this paper, we tackle the extremely difficult problem\nof combining distinct models with different initializations, each solving a\nseparate task, into one multi-task model without any additional training. Prior\nwork in model merging permutes one model to the space of the other then\naverages them together. While this works for models trained on the same task,\nwe find that this fails to account for the differences in models trained on\ndisjoint tasks. Thus, we introduce \"ZipIt!\", a general method for merging two\narbitrary models of the same architecture that incorporates two simple\nstrategies. First, in order to account for features that aren't shared between\nmodels, we expand the model merging problem to allow for merging features\nwithin each model by defining a general \"zip\" operation. Second, we add support\nfor partially zipping the models up until a specified layer, naturally creating\na multi-head model. We find that these two changes combined account for 20-60%\nimprovement over prior work, making it more feasible to merge models trained on\ndisjoint tasks without retraining.\n","authors":["George Stoica","Daniel Bolya","Jakob Bjorner","Pratik Ramesh","Taylor Hearn","Judy Hoffman"],"pdf_url":"https://arxiv.org/pdf/2305.03053v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12350v1","updated":"2024-01-22T20:32:31Z","published":"2024-01-22T20:32:31Z","title":"Scaling Up Quantization-Aware Neural Architecture Search for Efficient\n Deep Learning on the Edge","summary":" Neural Architecture Search (NAS) has become the de-facto approach for\ndesigning accurate and efficient networks for edge devices. Since models are\ntypically quantized for edge deployment, recent work has investigated\nquantization-aware NAS (QA-NAS) to search for highly accurate and efficient\nquantized models. However, existing QA-NAS approaches, particularly few-bit\nmixed-precision (FB-MP) methods, do not scale to larger tasks. Consequently,\nQA-NAS has mostly been limited to low-scale tasks and tiny networks. In this\nwork, we present an approach to enable QA-NAS (INT8 and FB-MP) on large-scale\ntasks by leveraging the block-wise formulation introduced by block-wise NAS. We\ndemonstrate strong results for the semantic segmentation task on the Cityscapes\ndataset, finding FB-MP models 33% smaller and INT8 models 17.6% faster than\nDeepLabV3 (INT8) without compromising task performance.\n","authors":["Yao Lu","Hiram Rayo Torres Rodriguez","Sebastian Vogel","Nick van de Waterlaat","Pavol Jancura"],"pdf_url":"https://arxiv.org/pdf/2401.12350v1.pdf","comment":"Accepted at Workshop on Compilers, Deployment, and Tooling for Edge\n AI (CODAI '23 ), September 21, 2023, Hamburg, Germany"},{"id":"http://arxiv.org/abs/2401.12344v1","updated":"2024-01-22T20:17:14Z","published":"2024-01-22T20:17:14Z","title":"OCT-SelfNet: A Self-Supervised Framework with Multi-Modal Datasets for\n Generalized and Robust Retinal Disease Detection","summary":" Despite the revolutionary impact of AI and the development of locally trained\nalgorithms, achieving widespread generalized learning from multi-modal data in\nmedical AI remains a significant challenge. This gap hinders the practical\ndeployment of scalable medical AI solutions. Addressing this challenge, our\nresearch contributes a self-supervised robust machine learning framework,\nOCT-SelfNet, for detecting eye diseases using optical coherence tomography\n(OCT) images. In this work, various data sets from various institutions are\ncombined enabling a more comprehensive range of representation. Our method\naddresses the issue using a two-phase training approach that combines\nself-supervised pretraining and supervised fine-tuning with a mask autoencoder\nbased on the SwinV2 backbone by providing a solution for real-world clinical\ndeployment. Extensive experiments on three datasets with different encoder\nbackbones, low data settings, unseen data settings, and the effect of\naugmentation show that our method outperforms the baseline model, Resnet-50 by\nconsistently attaining AUC-ROC performance surpassing 77% across all tests,\nwhereas the baseline model exceeds 54%. Moreover, in terms of the AUC-PR\nmetric, our proposed method exceeded 42%, showcasing a substantial increase of\nat least 10% in performance compared to the baseline, which exceeded only 33%.\nThis contributes to our understanding of our approach's potential and\nemphasizes its usefulness in clinical settings.\n","authors":["Fatema-E Jannat","Sina Gholami","Minhaj Nur Alam","Hamed Tabkhi"],"pdf_url":"https://arxiv.org/pdf/2401.12344v1.pdf","comment":"12 pages, 7 figures, 6 tables"},{"id":"http://arxiv.org/abs/2401.12340v1","updated":"2024-01-22T20:08:57Z","published":"2024-01-22T20:08:57Z","title":"Contrastive Learning and Cycle Consistency-based Transductive Transfer\n Learning for Target Annotation","summary":" Annotating automatic target recognition (ATR) is a highly challenging task,\nprimarily due to the unavailability of labeled data in the target domain.\nHence, it is essential to construct an optimal target domain classifier by\nutilizing the labeled information of the source domain images. The transductive\ntransfer learning (TTL) method that incorporates a CycleGAN-based unpaired\ndomain translation network has been previously proposed in the literature for\neffective ATR annotation. Although this method demonstrates great potential for\nATR, it severely suffers from lower annotation performance, higher Fr\\'echet\nInception Distance (FID) score, and the presence of visual artifacts in the\nsynthetic images. To address these issues, we propose a hybrid contrastive\nlearning base unpaired domain translation (H-CUT) network that achieves a\nsignificantly lower FID score. It incorporates both attention and entropy to\nemphasize the domain-specific region, a noisy feature mixup module to generate\nhigh variational synthetic negative patches, and a modulated noise contrastive\nestimation (MoNCE) loss to reweight all negative patches using optimal\ntransport for better performance. Our proposed contrastive learning and\ncycle-consistency-based TTL (C3TTL) framework consists of two H-CUT networks\nand two classifiers. It simultaneously optimizes cycle-consistency, MoNCE, and\nidentity losses. In C3TTL, two H-CUT networks have been employed through a\nbijection mapping to feed the reconstructed source domain images into a\npretrained classifier to guide the optimal target domain classifier. Extensive\nexperimental analysis conducted on three ATR datasets demonstrates that the\nproposed C3TTL method is effective in annotating civilian and military\nvehicles, as well as ship targets.\n","authors":["Shoaib Meraj Sami","Md Mahedi Hasan","Nasser M. Nasrabadi","Raghuveer Rao"],"pdf_url":"https://arxiv.org/pdf/2401.12340v1.pdf","comment":"This Paper is Accepted in IEEE TRANSACTIONS ON AEROSPACE AND\n ELECTRONIC SYSTEMS. This Arxiv version is an older version than the reviewed\n version"},{"id":"http://arxiv.org/abs/2002.04251v3","updated":"2024-01-22T20:05:23Z","published":"2020-02-11T08:24:19Z","title":"2.75D: Boosting learning by representing 3D Medical imaging to 2D\n features for small data","summary":" In medical-data driven learning, 3D convolutional neural networks (CNNs) have\nstarted to show superior performance to 2D CNNs in numerous deep learning\ntasks, proving the added value of 3D spatial information in feature\nrepresentation. However, the difficulty in collecting more training samples to\nconverge, more computational resources and longer execution time make this\napproach less applied. Also, applying transfer learning on 3D CNN is\nchallenging due to a lack of publicly available pre-trained 3D models. To\ntackle these issues, we proposed a novel 2D strategical representation of\nvolumetric data, namely 2.75D. In this work, the spatial information of 3D\nimages is captured in a single 2D view by a spiral-spinning technique. As a\nresult, 2D CNN networks can also be used to learn volumetric information.\nBesides, we can fully leverage pre-trained 2D CNNs for downstream vision\nproblems. We also explore a multi-view 2.75D strategy, 2.75D 3 channels\n(2.75Dx3), to boost the advantage of 2.75D. We evaluated the proposed methods\non three public datasets with different modalities or organs (Lung CT, Breast\nMRI, and Prostate MRI), against their 2D, 2.5D, and 3D counterparts in\nclassification tasks. Results show that the proposed methods significantly\noutperform other counterparts when all methods were trained from scratch on the\nlung dataset. Such performance gain is more pronounced with transfer learning\nor in the case of limited training data. Our methods also achieved comparable\nperformance on other datasets. In addition, our methods achieved a substantial\nreduction in time consumption of training and inference compared with the 2.5D\nor 3D method.\n","authors":["Xin Wang","Ruisheng Su","Weiyi Xie","Wenjin Wang","Yi Xu","Ritse Mann","Jungong Han","Tao Tan"],"pdf_url":"https://arxiv.org/pdf/2002.04251v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.17189v3","updated":"2024-01-22T19:29:16Z","published":"2023-09-29T12:38:00Z","title":"RTFS-Net: Recurrent time-frequency modelling for efficient audio-visual\n speech separation","summary":" Audio-visual speech separation methods aim to integrate different modalities\nto generate high-quality separated speech, thereby enhancing the performance of\ndownstream tasks such as speech recognition. Most existing state-of-the-art\n(SOTA) models operate in the time domain. However, their overly simplistic\napproach to modeling acoustic features often necessitates larger and more\ncomputationally intensive models in order to achieve SOTA performance. In this\npaper, we present a novel time-frequency domain audio-visual speech separation\nmethod: Recurrent Time-Frequency Separation Network (RTFS-Net), which applies\nits algorithms on the complex time-frequency bins yielded by the Short-Time\nFourier Transform. We model and capture the time and frequency dimensions of\nthe audio independently using a multi-layered RNN along each dimension.\nFurthermore, we introduce a unique attention-based fusion technique for the\nefficient integration of audio and visual information, and a new mask\nseparation approach that takes advantage of the intrinsic spectral nature of\nthe acoustic features for a clearer separation. RTFS-Net outperforms the\nprevious SOTA method using only 10% of the parameters and 18% of the MACs. This\nis the first time-frequency domain audio-visual speech separation method to\noutperform all contemporary time-domain counterparts.\n","authors":["Samuel Pegg","Kai Li","Xiaolin Hu"],"pdf_url":"https://arxiv.org/pdf/2309.17189v3.pdf","comment":"Accepted by ICLR 2024"},{"id":"http://arxiv.org/abs/2310.05207v2","updated":"2024-01-22T19:06:15Z","published":"2023-10-08T15:49:26Z","title":"Boosting Facial Action Unit Detection Through Jointly Learning Facial\n Landmark Detection and Domain Separation and Reconstruction","summary":" Recently how to introduce large amounts of unlabeled facial images in the\nwild into supervised Facial Action Unit (AU) detection frameworks has become a\nchallenging problem. In this paper, we propose a new AU detection framework\nwhere multi-task learning is introduced to jointly learn AU domain separation\nand reconstruction and facial landmark detection by sharing the parameters of\nhomostructural facial extraction modules. In addition, we propose a new feature\nalignment scheme based on contrastive learning by simple projectors and an\nimproved contrastive loss, which adds four additional intermediate supervisors\nto promote the feature reconstruction process. Experimental results on two\nbenchmarks demonstrate our superiority against the state-of-the-art methods for\nAU detection in the wild.\n","authors":["Ziqiao Shang","Li Yu"],"pdf_url":"https://arxiv.org/pdf/2310.05207v2.pdf","comment":"5 pages, 1 figure, published to ICASSP 2024"},{"id":"http://arxiv.org/abs/2401.12275v1","updated":"2024-01-22T18:58:22Z","published":"2024-01-22T18:58:22Z","title":"Multi-Agent Dynamic Relational Reasoning for Social Robot Navigation","summary":" Social robot navigation can be helpful in various contexts of daily life but\nrequires safe human-robot interactions and efficient trajectory planning. While\nmodeling pairwise relations has been widely studied in multi-agent interacting\nsystems, the ability to capture larger-scale group-wise activities is limited.\nIn this paper, we propose a systematic relational reasoning approach with\nexplicit inference of the underlying dynamically evolving relational\nstructures, and we demonstrate its effectiveness for multi-agent trajectory\nprediction and social robot navigation. In addition to the edges between pairs\nof nodes (i.e., agents), we propose to infer hyperedges that adaptively connect\nmultiple nodes to enable group-wise reasoning in an unsupervised manner. Our\napproach infers dynamically evolving relation graphs and hypergraphs to capture\nthe evolution of relations, which the trajectory predictor employs to generate\nfuture states. Meanwhile, we propose to regularize the sharpness and sparsity\nof the learned relations and the smoothness of the relation evolution, which\nproves to enhance training stability and model performance. The proposed\napproach is validated on synthetic crowd simulations and real-world benchmark\ndatasets. Experiments demonstrate that the approach infers reasonable relations\nand achieves state-of-the-art prediction performance. In addition, we present a\ndeep reinforcement learning (DRL) framework for social robot navigation, which\nincorporates relational reasoning and trajectory prediction systematically. In\na group-based crowd simulation, our method outperforms the strongest baseline\nby a significant margin in terms of safety, efficiency, and social compliance\nin dense, interactive scenarios.\n","authors":["Jiachen Li","Chuanbo Hua","Hengbo Ma","Jinkyoo Park","Victoria Dax","Mykel J. Kochenderfer"],"pdf_url":"https://arxiv.org/pdf/2401.12275v1.pdf","comment":"19 pages, 8 figures, 6 tables"},{"id":"http://arxiv.org/abs/2312.07063v2","updated":"2024-01-22T15:30:26Z","published":"2023-12-12T08:32:55Z","title":"Template Free Reconstruction of Human-object Interaction with Procedural\n Interaction Generation","summary":" Reconstructing human-object interaction in 3D from a single RGB image is a\nchallenging task and existing data driven methods do not generalize beyond the\nobjects present in the carefully curated 3D interaction datasets. Capturing\nlarge-scale real data to learn strong interaction and 3D shape priors is very\nexpensive due to the combinatorial nature of human-object interactions. In this\npaper, we propose ProciGen (Procedural interaction Generation), a method to\nprocedurally generate datasets with both, plausible interaction and diverse\nobject variation. We generate 1M+ human-object interaction pairs in 3D and\nleverage this large-scale data to train our HDM (Hierarchical Diffusion Model),\na novel method to reconstruct interacting human and unseen objects, without any\ntemplates. Our HDM is an image-conditioned diffusion model that learns both\nrealistic interaction and highly accurate human and object shapes. Experiments\nshow that our HDM trained with ProciGen significantly outperforms prior methods\nthat requires template meshes and that our dataset allows training methods with\nstrong generalization ability to unseen object instances. Our code and data\nwill be publicly released at:\nhttps://virtualhumans.mpi-inf.mpg.de/procigen-hdm.\n","authors":["Xianghui Xie","Bharat Lal Bhatnagar","Jan Eric Lenssen","Gerard Pons-Moll"],"pdf_url":"https://arxiv.org/pdf/2312.07063v2.pdf","comment":"23 pages, 18 figures. Project page:\n https://virtualhumans.mpi-inf.mpg.de/procigen-hdm (updated the\n acknowledgement)"}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2204.11209v3","updated":"2024-01-22T14:13:11Z","published":"2022-04-24T07:18:04Z","title":"Hierarchical Locality Sensitive Hashing for Structured Data: A Survey","summary":" Data similarity (or distance) computation is a fundamental research topic\nwhich fosters a variety of similarity-based machine learning and data mining\napplications. In big data analytics, it is impractical to compute the exact\nsimilarity of data instances due to high computational cost. To this end, the\nLocality Sensitive Hashing (LSH) technique has been proposed to provide\naccurate estimators for various similarity measures between sets or vectors in\nan efficient manner without the learning process. Structured data (e.g.,\nsequences, trees and graphs), which are composed of elements and relations\nbetween the elements, are commonly seen in the real world, but the traditional\nLSH algorithms cannot preserve the structure information represented as\nrelations between elements. In order to conquer the issue, researchers have\nbeen devoted to the family of the hierarchical LSH algorithms. In this paper,\nwe explore the present progress of the research into hierarchical LSH from the\nfollowing perspectives: 1) Data structures, where we review various\nhierarchical LSH algorithms for three typical data structures and uncover their\ninherent connections; 2) Applications, where we review the hierarchical LSH\nalgorithms in multiple application scenarios; 3) Challenges, where we discuss\nsome potential challenges as future directions.\n","authors":["Wei Wu","Bin Li"],"pdf_url":"https://arxiv.org/pdf/2204.11209v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.16034v2","updated":"2024-01-22T11:26:35Z","published":"2023-09-27T21:26:01Z","title":"Analytical Modelling of Raw Data for Flow-Guided In-body Nanoscale\n Localization","summary":" Advancements in nanotechnology and material science are paving the way toward\nnanoscale devices that combine sensing, computing, data and energy storage, and\nwireless communication. In precision medicine, these nanodevices show promise\nfor disease diagnostics, treatment, and monitoring from within the patients'\nbloodstreams. Assigning the location of a sensed biological event with the\nevent itself, which is the main proposition of flow-guided in-body nanoscale\nlocalization, would be immensely beneficial from the perspective of precision\nmedicine. The nanoscale nature of the nanodevices and the challenging\nenvironment that the bloodstream represents, result in current flow-guided\nlocalization approaches being constrained in their communication and\nenergy-related capabilities. The communication and energy constraints of the\nnanodevices result in different features of raw data for flow-guided\nlocalization, in turn affecting its performance. An analytical modeling of the\neffects of imperfect communication and constrained energy causing intermittent\noperation of the nanodevices on the raw data produced by the nanodevices would\nbe beneficial. Hence, we propose an analytical model of raw data for\nflow-guided localization, where the raw data is modeled as a function of\ncommunication and energy-related capabilities of the nanodevice. We evaluate\nthe model by comparing its output with the one obtained through the utilization\nof a simulator for objective evaluation of flow-guided localization, featuring\ncomparably higher level of realism. Our results across a number of scenarios\nand heterogeneous performance metrics indicate high similarity between the\nmodel and simulator-generated raw datasets.\n","authors":["Guillem Pascual","Filip Lemic","Carmen Delgado","Xavier Costa-Perez"],"pdf_url":"https://arxiv.org/pdf/2309.16034v2.pdf","comment":"6 pages, 7 figures, 4 tables, 16 references"},{"id":"http://arxiv.org/abs/2401.11800v1","updated":"2024-01-22T10:01:06Z","published":"2024-01-22T10:01:06Z","title":"Revisiting Document-Level Relation Extraction with Context-Guided Link\n Prediction","summary":" Document-level relation extraction (DocRE) poses the challenge of identifying\nrelationships between entities within a document as opposed to the traditional\nRE setting where a single sentence is input. Existing approaches rely on\nlogical reasoning or contextual cues from entities. This paper reframes\ndocument-level RE as link prediction over a knowledge graph with distinct\nbenefits: 1) Our approach combines entity context with document-derived logical\nreasoning, enhancing link prediction quality. 2) Predicted links between\nentities offer interpretability, elucidating employed reasoning. We evaluate\nour approach on three benchmark datasets: DocRED, ReDocRED, and DWIE. The\nresults indicate that our proposed method outperforms the state-of-the-art\nmodels and suggests that incorporating context-based link prediction techniques\ncan enhance the performance of document-level relation extraction models.\n","authors":["Monika Jain","Raghava Mutharaju","Ramakanth Kavuluru","Kuldeep Singh"],"pdf_url":"https://arxiv.org/pdf/2401.11800v1.pdf","comment":"Accepted in AAAI 2024"},{"id":"http://arxiv.org/abs/2305.19604v3","updated":"2024-01-22T08:13:50Z","published":"2023-05-31T07:22:15Z","title":"Medication Recommendation via Domain Knowledge Informed Deep Learning","summary":" Medication recommendation is a fundamental yet crucial branch of healthcare,\nwhich provides opportunities to support clinical physicians with more accurate\nmedication prescriptions for patients with complex health conditions. Learning\nfrom electronic health records (EHR) to recommend medications is the most\ncommon way in previous studies. However, most of them neglect incorporating\ndomain knowledge according to the clinical manifestations in the EHR of the\npatient. To address these issues, we propose a novel \\textbf{D}omain\n\\textbf{K}nowledge \\textbf{I}nformed \\textbf{Net}work (DKINet) to integrate\ndomain knowledge with observable clinical manifestations of the patient, which\nis the first dynamic domain knowledge informed framework toward medication\nrecommendation. In particular, we first design a knowledge-driven encoder to\ncapture the domain information and then develop a data-driven encoder to\nintegrate domain knowledge into the observable EHR. To endow the model with the\ncapability of temporal decision, we design an explicit medication encoder for\nlearning the longitudinal dependence of the patient. Extensive experiments on\nthree publicly available datasets verify the superiority of our method. The\ncode will be public upon acceptance.\n","authors":["Sicen Liu","Xiaolong Wang","Xianbing Zhao","Hao Chen"],"pdf_url":"https://arxiv.org/pdf/2305.19604v3.pdf","comment":"11 pages, 4 figures"},{"id":"http://arxiv.org/abs/2401.11742v1","updated":"2024-01-22T08:00:49Z","published":"2024-01-22T08:00:49Z","title":"Knowledge Navigation: Inferring the Interlocking Map of Knowledge from\n Research Trajectories","summary":" \"If I have seen further, it is by standing on the shoulders of giants,\" Isaac\nNewton's renowned statement hints that new knowledge builds upon existing\nfoundations, which means there exists an interdependent relationship between\nknowledge, which, yet uncovered, is implied in the historical development of\nscientific systems for hundreds of years. By leveraging natural language\nprocessing techniques, this study introduces an innovative embedding scheme\ndesigned to infer the \"knowledge interlocking map.\" This map, derived from the\nresearch trajectories of millions of scholars, reveals the intricate\nconnections among knowledge. We validate that the inferred map effectively\ndelineates disciplinary boundaries and captures the intricate relationships\nbetween diverse concepts. The utility of the interlocking map is showcased\nthrough multiple applications. Firstly, we demonstrated the multi-step analogy\ninferences within the knowledge space and the functional connectivity between\nconcepts in different disciplines. Secondly, we trace the evolution of\nknowledge across domains, observing trends such as shifts from \"Theoretical\" to\n\"Applied\" or \"Chemistry\" to \"Biomedical\" along predefined functional\ndirections. Lastly, by analyzing the high-dimensional knowledge network\nstructure, we found that knowledge connects each other with shorter global\npathways, and the interdisciplinary knowledge plays a critical role in\naccessibility of the global knowledge network. Our framework offers a novel\napproach to mining knowledge inheritance pathways in extensive scientific\nliterature, which is of great significance for understanding scientific\ndevelopment patterns, tailoring scientific learning trajectories, and\naccelerating scientific progress.\n","authors":["Shibing Xiang","Bing Liu","Yurui Huang","Chaolin Tian","Xin Jiang","Yifang Ma"],"pdf_url":"https://arxiv.org/pdf/2401.11742v1.pdf","comment":"28 pages, 9 figures, 5 tables"},{"id":"http://arxiv.org/abs/2304.01225v2","updated":"2024-01-22T06:31:50Z","published":"2023-04-02T07:25:01Z","title":"A greedy approach for increased vehicle utilization in ridesharing\n networks","summary":" In recent years, ridesharing platforms have become a prominent mode of\ntransportation for the residents of urban areas. As a fundamental problem,\nroute recommendation for these platforms is vital for their sustenance. The\nworks done in this direction have recommended routes with higher passenger\ndemand. Despite the existing works, statistics have suggested that these\nservices cause increased greenhouse emissions compared to private vehicles as\nthey roam around in search of riders. This analysis provides finer details\nregarding the functionality of ridesharing systems and it reveals that in the\nface of their boom, they have not utilized the vehicle capacity efficiently. We\npropose to overcome the above limitations and recommend routes that will fetch\nmultiple passengers simultaneously which will result in increased vehicle\nutilization and thereby decrease the effect of these systems on the\nenvironment. As route recommendation is NP-hard, we propose a k-hop-based\nsliding window approximation algorithm that reduces the search space from\nentire road network to a window. We further demonstrate that maximizing\nexpected demand is submodular and greedy algorithms can be used to optimize our\nobjective function within a window. We evaluate our proposed model on\nreal-world datasets and experimental results demonstrate superior performance\nby our proposed model.\n","authors":["Aqsa Ashraf Makhdomi","Iqra Altaf Gillani"],"pdf_url":"https://arxiv.org/pdf/2304.01225v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11705v1","updated":"2024-01-22T06:12:48Z","published":"2024-01-22T06:12:48Z","title":"Domain-Aware Cross-Attention for Cross-domain Recommendation","summary":" Cross-domain recommendation (CDR) is an important method to improve\nrecommender system performance, especially when observations in target domains\nare sparse. However, most existing cross-domain recommendations fail to fully\nutilize the target domain's special features and are hard to be generalized to\nnew domains. The designed network is complex and is not suitable for rapid\nindustrial deployment. Our method introduces a two-step domain-aware\ncross-attention, extracting transferable features of the source domain from\ndifferent granularity, which allows the efficient expression of both domain and\nuser interests. In addition, we simplify the training process, and our model\ncan be easily deployed on new domains. We conduct experiments on both public\ndatasets and industrial datasets, and the experimental results demonstrate the\neffectiveness of our method. We have also deployed the model in an online\nadvertising system and observed significant improvements in both\nClick-Through-Rate (CTR) and effective cost per mille (ECPM).\n","authors":["Yuhao Luo","Shiwei Ma","Mingjun Nie","Changping Peng","Zhangang Lin","Jingping Shao","Qianfang Xu"],"pdf_url":"https://arxiv.org/pdf/2401.11705v1.pdf","comment":"6 pages, 1 figure"},{"id":"http://arxiv.org/abs/2401.11648v1","updated":"2024-01-22T01:58:32Z","published":"2024-01-22T01:58:32Z","title":"Next Visit Diagnosis Prediction via Medical Code-Centric Multimodal\n Contrastive EHR Modelling with Hierarchical Regularisation","summary":" Predicting next visit diagnosis using Electronic Health Records (EHR) is an\nessential task in healthcare, critical for devising proactive future plans for\nboth healthcare providers and patients. Nonetheless, many preceding studies\nhave not sufficiently addressed the heterogeneous and hierarchical\ncharacteristics inherent in EHR data, inevitably leading to sub-optimal\nperformance. To this end, we propose NECHO, a novel medical code-centric\nmultimodal contrastive EHR learning framework with hierarchical regularisation.\nFirst, we integrate multifaceted information encompassing medical codes,\ndemographics, and clinical notes using a tailored network design and a pair of\nbimodal contrastive losses, all of which pivot around a medical code\nrepresentation. We also regularise modality-specific encoders using a parental\nlevel information in medical ontology to learn hierarchical structure of EHR\ndata. A series of experiments on MIMIC-III data demonstrates effectiveness of\nour approach.\n","authors":["Heejoon Koo"],"pdf_url":"https://arxiv.org/pdf/2401.11648v1.pdf","comment":"Accepted to EACL 2024 (The 18th Conference of the European Chapter of\n the Association for Computational Linguistics)"},{"id":"http://arxiv.org/abs/2306.16001v2","updated":"2024-01-22T00:27:45Z","published":"2023-06-28T08:20:35Z","title":"Streamlining Social Media Information Extraction for Public Health\n Research with Deep Learning","summary":" Objective: Social media-based public health research is crucial for epidemic\nsurveillance, but most studies identify relevant corpora with keyword matching.\nThis study develops a system to streamline the process of curating colloquial\nmedical dictionaries. We demonstrate the pipeline by curating a UMLS-colloquial\nsymptom dictionary from COVID-19-related tweets as proof of concept. Methods:\nCOVID-19-related tweets from February 1, 2020, to April 30, 2022 were used. The\npipeline includes three modules: a named entity recognition module to detect\nsymptoms in tweets; an entity normalization module to aggregate detected\nentities; and a mapping module that iteratively maps entities to Unified\nMedical Language System concepts. A random 500 entity sample were drawn from\nthe final dictionary for accuracy validation. Additionally, we conducted a\nsymptom frequency distribution analysis to compare our dictionary to a\npre-defined lexicon from previous research. Results: We identified 498,480\nunique symptom entity expressions from the tweets. Pre-processing reduces the\nnumber to 18,226. The final dictionary contains 38,175 unique expressions of\nsymptoms that can be mapped to 966 UMLS concepts (accuracy = 95%). Symptom\ndistribution analysis found that our dictionary detects more symptoms and is\neffective at identifying psychiatric disorders like anxiety and depression,\noften missed by pre-defined lexicons. Conclusion: This study advances public\nhealth research by implementing a novel, systematic pipeline for curating\nsymptom lexicons from social media data. The final lexicon's high accuracy,\nvalidated by medical professionals, underscores the potential of this\nmethodology to reliably interpret and categorize vast amounts of unstructured\nsocial media data into actionable medical insights across diverse linguistic\nand regional landscapes.\n","authors":["Yining Hua","Shixu Lin","Minghui Li","Yujie Zhang","Dinah Foer","Siwen Wang","Peilin Zhou","Li Zhou","Jie Yang"],"pdf_url":"https://arxiv.org/pdf/2306.16001v2.pdf","comment":"Updated full paper. Abstract presented at IEEE ICHI 2023 and AMIA\n Annual Symposium 2023"},{"id":"http://arxiv.org/abs/2312.11486v2","updated":"2024-01-22T19:57:27Z","published":"2023-11-30T11:49:33Z","title":"Preference and Concurrence Aware Bayesian Graph Neural Networks for\n Recommender Systems","summary":" Graph-based collaborative filtering methods have prevailing performance for\nrecommender systems since they can capture high-order information between users\nand items, in which the graphs are constructed from the observed user-item\ninteractions that might miss links or contain spurious positive interactions in\nindustrial scenarios. The Bayesian Graph Neural Network framework approaches\nthis issue with generative models for the interaction graphs. The critical\nproblem is to devise a proper family of graph generative models tailored to\nrecommender systems. We propose an efficient generative model that jointly\nconsiders the preferences of users, the concurrence of items and some important\ngraph structure information. Experiments on four popular benchmark datasets\ndemonstrate the effectiveness of our proposed graph generative methods for\nrecommender systems.\n","authors":["Hongjian Gu","Yaochen Hu","Yingxue Zhang"],"pdf_url":"https://arxiv.org/pdf/2312.11486v2.pdf","comment":null}],"Machine Learning":[{"id":"http://arxiv.org/abs/2401.12217v1","updated":"2024-01-22T18:59:29Z","published":"2024-01-22T18:59:29Z","title":"Exploring Simple Open-Vocabulary Semantic Segmentation","summary":" Open-vocabulary semantic segmentation models aim to accurately assign a\nsemantic label to each pixel in an image from a set of arbitrary\nopen-vocabulary texts. In order to learn such pixel-level alignment, current\napproaches typically rely on a combination of (i) image-level VL model (e.g.\nCLIP), (ii) ground truth masks, and (iii) custom grouping encoders. In this\npaper, we introduce S-Seg, a novel model that can achieve surprisingly strong\nperformance without depending on any of the above elements. S-Seg leverages\npseudo-mask and language to train a MaskFormer, and can be easily trained from\npublicly available image-text datasets. Contrary to prior works, our model\ndirectly trains for pixel-level features and language alignment. Once trained,\nS-Seg generalizes well to multiple testing datasets without requiring\nfine-tuning. In addition, S-Seg has the extra benefits of scalability with data\nand consistently improvement when augmented with self-training. We believe that\nour simple yet effective approach will serve as a solid baseline for future\nresearch.\n","authors":["Zihang Lai"],"pdf_url":"https://arxiv.org/pdf/2401.12217v1.pdf","comment":"Code is available at: https://github.com/zlai0/S-Seg"},{"id":"http://arxiv.org/abs/2401.12216v1","updated":"2024-01-22T18:59:12Z","published":"2024-01-22T18:59:12Z","title":"Mitigating Covariate Shift in Misspecified Regression with Applications\n to Reinforcement Learning","summary":" A pervasive phenomenon in machine learning applications is distribution\nshift, where training and deployment conditions for a machine learning model\ndiffer. As distribution shift typically results in a degradation in\nperformance, much attention has been devoted to algorithmic interventions that\nmitigate these detrimental effects. In this paper, we study the effect of\ndistribution shift in the presence of model misspecification, specifically\nfocusing on $L_{\\infty}$-misspecified regression and adversarial covariate\nshift, where the regression target remains fixed while the covariate\ndistribution changes arbitrarily. We show that empirical risk minimization, or\nstandard least squares regression, can result in undesirable misspecification\namplification where the error due to misspecification is amplified by the\ndensity ratio between the training and testing distributions. As our main\nresult, we develop a new algorithm -- inspired by robust optimization\ntechniques -- that avoids this undesirable behavior, resulting in no\nmisspecification amplification while still obtaining optimal statistical rates.\nAs applications, we use this regression procedure to obtain new guarantees in\noffline and online reinforcement learning with misspecification and establish\nnew separations between previously studied structural conditions and notions of\ncoverage.\n","authors":["Philip Amortila","Tongyi Cao","Akshay Krishnamurthy"],"pdf_url":"https://arxiv.org/pdf/2401.12216v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.13507v2","updated":"2024-01-22T18:54:52Z","published":"2023-08-25T17:33:05Z","title":"Large Language Models Should Ask Clarifying Questions to Increase\n Confidence in Generated Code","summary":" Large language models (LLMs) have significantly improved the ability to\nperform tasks in the field of code generation. However, there is still a gap\nbetween LLMs being capable coders and being top-tier software engineers. Based\non the observation that toplevel software engineers often ask clarifying\nquestions to reduce ambiguity in both requirements and coding solutions, I\nargue that the same should be applied to LLMs for code generation tasks. By\nasking probing questions in various topics before generating the final code,\nthe challenges of programming with LLMs, such as unclear intent specification,\nlack of computational thinking, and undesired code quality, may be alleviated.\nThis, in turn, increases confidence in the generated code. In this work, I\nexplore how to leverage better communication skills to achieve greater\nconfidence in generated code. I propose a communication-centered process that\nuses an LLM-generated communicator to identify issues with high ambiguity or\nlow confidence in problem descriptions and generated code. I then ask\nclarifying questions to obtain responses from users for refining the code.\n","authors":["Jie JW Wu"],"pdf_url":"https://arxiv.org/pdf/2308.13507v2.pdf","comment":"6 pages, 2 figures, 1 table. Accepted and presented at the 7th Annual\n Symposium on Machine Programming (MAPS 2023 Workshop, see\n https://mapsworkshop.github.io/). Reference: \"Wu, Jie JW. Large Language\n Models Should Ask Clarifying Questions to Increase Confidence in Generated\n Code. The 7th Annual Symposium on Machine Programming (MAPS 23), December 3,\n 2023, San Francisco, CA, USA\""},{"id":"http://arxiv.org/abs/2401.03506v3","updated":"2024-01-22T18:53:36Z","published":"2024-01-07T14:54:57Z","title":"DiarizationLM: Speaker Diarization Post-Processing with Large Language\n Models","summary":" In this paper, we introduce DiarizationLM, a framework to leverage large\nlanguage models (LLM) to post-process the outputs from a speaker diarization\nsystem. Various goals can be achieved with the proposed framework, such as\nimproving the readability of the diarized transcript, or reducing the word\ndiarization error rate (WDER). In this framework, the outputs of the automatic\nspeech recognition (ASR) and speaker diarization systems are represented as a\ncompact textual format, which is included in the prompt to an optionally\nfinetuned LLM. The outputs of the LLM can be used as the refined diarization\nresults with the desired enhancement. As a post-processing step, this framework\ncan be easily applied to any off-the-shelf ASR and speaker diarization systems\nwithout retraining existing components. Our experiments show that a finetuned\nPaLM 2-S model can reduce the WDER by rel. 55.5% on the Fisher telephone\nconversation dataset, and rel. 44.9% on the Callhome English dataset.\n","authors":["Quan Wang","Yiling Huang","Guanlong Zhao","Evan Clark","Wei Xia","Hank Liao"],"pdf_url":"https://arxiv.org/pdf/2401.03506v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12207v1","updated":"2024-01-22T18:49:56Z","published":"2024-01-22T18:49:56Z","title":"Rate-Distortion-Perception Tradeoff Based on the\n Conditional-Distribution Perception Measure","summary":" We study the rate-distortion-perception (RDP) tradeoff for a memoryless\nsource model in the asymptotic limit of large block-lengths. Our perception\nmeasure is based on a divergence between the distributions of the source and\nreconstruction sequences conditioned on the encoder output, which was first\nproposed in [1], [2]. We consider the case when there is no shared randomness\nbetween the encoder and the decoder. For the case of discrete memoryless\nsources we derive a single-letter characterization of the RDP function, thus\nsettling a problem that remains open for the marginal metric introduced in Blau\nand Michaeli [3] (with no shared randomness). Our achievability scheme is based\non lossy source coding with a posterior reference map proposed in [4]. For the\ncase of continuous valued sources under squared error distortion measure and\nsquared quadratic Wasserstein perception measure we also derive a single-letter\ncharacterization and show that a noise-adding mechanism at the decoder suffices\nto achieve the optimal representation. For the case of zero perception loss, we\nshow that our characterization interestingly coincides with the results for the\nmarginal metric derived in [5], [6] and again demonstrate that zero perception\nloss can be achieved with a $3$-dB penalty in the minimum distortion. Finally\nwe specialize our results to the case of Gaussian sources. We derive the RDP\nfunction for vector Gaussian sources and propose a waterfilling type solution.\nWe also partially characterize the RDP function for a mixture of vector\nGaussians.\n","authors":["Sadaf Salehkalaibar","Jun Chen","Ashish Khisti","Wei Yu"],"pdf_url":"https://arxiv.org/pdf/2401.12207v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12205v1","updated":"2024-01-22T18:46:30Z","published":"2024-01-22T18:46:30Z","title":"Retrieval-Guided Reinforcement Learning for Boolean Circuit Minimization","summary":" Logic synthesis, a pivotal stage in chip design, entails optimizing chip\nspecifications encoded in hardware description languages like Verilog into\nhighly efficient implementations using Boolean logic gates. The process\ninvolves a sequential application of logic minimization heuristics (``synthesis\nrecipe\"), with their arrangement significantly impacting crucial metrics such\nas area and delay. Addressing the challenge posed by the broad spectrum of\ndesign complexities - from variations of past designs (e.g., adders and\nmultipliers) to entirely novel configurations (e.g., innovative processor\ninstructions) - requires a nuanced `synthesis recipe` guided by human expertise\nand intuition. This study conducts a thorough examination of learning and\nsearch techniques for logic synthesis, unearthing a surprising revelation:\npre-trained agents, when confronted with entirely novel designs, may veer off\ncourse, detrimentally affecting the search trajectory. We present ABC-RL, a\nmeticulously tuned $\\alpha$ parameter that adeptly adjusts recommendations from\npre-trained agents during the search process. Computed based on similarity\nscores through nearest neighbor retrieval from the training dataset, ABC-RL\nyields superior synthesis recipes tailored for a wide array of hardware\ndesigns. Our findings showcase substantial enhancements in the\nQuality-of-result (QoR) of synthesized circuits, boasting improvements of up to\n24.8% compared to state-of-the-art techniques. Furthermore, ABC-RL achieves an\nimpressive up to 9x reduction in runtime (iso-QoR) when compared to current\nstate-of-the-art methodologies.\n","authors":["Animesh Basak Chowdhury","Marco Romanelli","Benjamin Tan","Ramesh Karri","Siddharth Garg"],"pdf_url":"https://arxiv.org/pdf/2401.12205v1.pdf","comment":"Accepted in ICLR 2024"},{"id":"http://arxiv.org/abs/2401.12202v1","updated":"2024-01-22T18:42:20Z","published":"2024-01-22T18:42:20Z","title":"OK-Robot: What Really Matters in Integrating Open-Knowledge Models for\n Robotics","summary":" Remarkable progress has been made in recent years in the fields of vision,\nlanguage, and robotics. We now have vision models capable of recognizing\nobjects based on language queries, navigation systems that can effectively\ncontrol mobile systems, and grasping models that can handle a wide range of\nobjects. Despite these advancements, general-purpose applications of robotics\nstill lag behind, even though they rely on these fundamental capabilities of\nrecognition, navigation, and grasping. In this paper, we adopt a systems-first\napproach to develop a new Open Knowledge-based robotics framework called\nOK-Robot. By combining Vision-Language Models (VLMs) for object detection,\nnavigation primitives for movement, and grasping primitives for object\nmanipulation, OK-Robot offers a integrated solution for pick-and-drop\noperations without requiring any training. To evaluate its performance, we run\nOK-Robot in 10 real-world home environments. The results demonstrate that\nOK-Robot achieves a 58.5% success rate in open-ended pick-and-drop tasks,\nrepresenting a new state-of-the-art in Open Vocabulary Mobile Manipulation\n(OVMM) with nearly 1.8x the performance of prior work. On cleaner, uncluttered\nenvironments, OK-Robot's performance increases to 82%. However, the most\nimportant insight gained from OK-Robot is the critical role of nuanced details\nwhen combining Open Knowledge systems like VLMs with robotic modules. Videos of\nour experiments are available on our website: https://ok-robot.github.io\n","authors":["Peiqi Liu","Yaswanth Orru","Chris Paxton","Nur Muhammad Mahi Shafiullah","Lerrel Pinto"],"pdf_url":"https://arxiv.org/pdf/2401.12202v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12200v1","updated":"2024-01-22T18:39:40Z","published":"2024-01-22T18:39:40Z","title":"APT: Adaptive Pruning and Tuning Pretrained Language Models for\n Efficient Training and Inference","summary":" Fine-tuning and inference with large Language Models (LM) are generally known\nto be expensive. Parameter-efficient fine-tuning over pretrained LMs reduces\ntraining memory by updating a small number of LM parameters but does not\nimprove inference efficiency. Structured pruning improves LM inference\nefficiency by removing consistent parameter blocks, yet often increases\ntraining memory and time. To improve both training and inference efficiency, we\nintroduce APT that adaptively prunes and tunes parameters for the LMs. At the\nearly stage of fine-tuning, APT dynamically adds salient tuning parameters for\nfast and accurate convergence while discarding unimportant parameters for\nefficiency. Compared to baselines, our experiments show that APT maintains up\nto 98% task performance when pruning RoBERTa and T5 models with 40% parameters\nleft while keeping 86.4% LLaMA models' performance with 70% parameters\nremained. Furthermore, APT speeds up LMs fine-tuning by up to 8x and reduces\nlarge LMs memory training footprint by up to 70%.\n","authors":["Bowen Zhao","Hannaneh Hajishirzi","Qingqing Cao"],"pdf_url":"https://arxiv.org/pdf/2401.12200v1.pdf","comment":"19 pages, 6 figures"},{"id":"http://arxiv.org/abs/2401.12187v1","updated":"2024-01-22T18:27:08Z","published":"2024-01-22T18:27:08Z","title":"WARM: On the Benefits of Weight Averaged Reward Models","summary":" Aligning large language models (LLMs) with human preferences through\nreinforcement learning (RLHF) can lead to reward hacking, where LLMs exploit\nfailures in the reward model (RM) to achieve seemingly high rewards without\nmeeting the underlying objectives. We identify two primary challenges when\ndesigning RMs to mitigate reward hacking: distribution shifts during the RL\nprocess and inconsistencies in human preferences. As a solution, we propose\nWeight Averaged Reward Models (WARM), first fine-tuning multiple RMs, then\naveraging them in the weight space. This strategy follows the observation that\nfine-tuned weights remain linearly mode connected when sharing the same\npre-training. By averaging weights, WARM improves efficiency compared to the\ntraditional ensembling of predictions, while improving reliability under\ndistribution shifts and robustness to preference inconsistencies. Our\nexperiments on summarization tasks, using best-of-N and RL methods, shows that\nWARM improves the overall quality and alignment of LLM predictions; for\nexample, a policy RL fine-tuned with WARM has a 79.4% win rate against a policy\nRL fine-tuned with a single RM.\n","authors":["Alexandre Ramé","Nino Vieillard","Léonard Hussenot","Robert Dadashi","Geoffrey Cideron","Olivier Bachem","Johan Ferret"],"pdf_url":"https://arxiv.org/pdf/2401.12187v1.pdf","comment":"14 pages, 9 figures"},{"id":"http://arxiv.org/abs/2401.10305v2","updated":"2024-01-22T18:12:20Z","published":"2024-01-18T13:18:51Z","title":"Personality Trait Inference Via Mobile Phone Sensors: A Machine Learning\n Approach","summary":" This study provides evidence that personality can be reliably predicted from\nactivity data collected through mobile phone sensors. Employing a set of well\ninformed indicators calculable from accelerometer records and movement\npatterns, we were able to predict users' personality up to a 0.78 F1 score on a\ntwo class problem. Given the fast growing number of data collected from mobile\nphones, our novel personality indicators open the door to exciting avenues for\nfuture research in social sciences. Our results reveal distinct behavioral\npatterns that proved to be differentially predictive of big five personality\ntraits. They potentially enable cost effective, questionnaire free\ninvestigation of personality related questions at an unprecedented scale. We\nshow how a combination of rich behavioral data obtained with smartphone sensing\nand the use of machine learning techniques can help to advance personality\nresearch and can inform both practitioners and researchers about the different\nbehavioral patterns of personality. These findings have practical implications\nfor organizations harnessing mobile sensor data for personality assessment,\nguiding the refinement of more precise and efficient prediction models in the\nfuture.\n","authors":["Wun Yung Shaney Sze","Maryglen Pearl Herrero","Roger Garriga"],"pdf_url":"https://arxiv.org/pdf/2401.10305v2.pdf","comment":"9 pages, 5 figures"},{"id":"http://arxiv.org/abs/2401.12181v1","updated":"2024-01-22T18:11:01Z","published":"2024-01-22T18:11:01Z","title":"Universal Neurons in GPT2 Language Models","summary":" A basic question within the emerging field of mechanistic interpretability is\nthe degree to which neural networks learn the same underlying mechanisms. In\nother words, are neural mechanisms universal across different models? In this\nwork, we study the universality of individual neurons across GPT2 models\ntrained from different initial random seeds, motivated by the hypothesis that\nuniversal neurons are likely to be interpretable. In particular, we compute\npairwise correlations of neuron activations over 100 million tokens for every\nneuron pair across five different seeds and find that 1-5\\% of neurons are\nuniversal, that is, pairs of neurons which consistently activate on the same\ninputs. We then study these universal neurons in detail, finding that they\nusually have clear interpretations and taxonomize them into a small number of\nneuron families. We conclude by studying patterns in neuron weights to\nestablish several universal functional roles of neurons in simple circuits:\ndeactivating attention heads, changing the entropy of the next token\ndistribution, and predicting the next token to (not) be within a particular\nset.\n","authors":["Wes Gurnee","Theo Horsley","Zifan Carl Guo","Tara Rezaei Kheirkhah","Qinyi Sun","Will Hathaway","Neel Nanda","Dimitris Bertsimas"],"pdf_url":"https://arxiv.org/pdf/2401.12181v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12179v1","updated":"2024-01-22T18:10:10Z","published":"2024-01-22T18:10:10Z","title":"DITTO: Diffusion Inference-Time T-Optimization for Music Generation","summary":" We propose Diffusion Inference-Time T-Optimization (DITTO), a general-purpose\nframe-work for controlling pre-trained text-to-music diffusion models at\ninference-time via optimizing initial noise latents. Our method can be used to\noptimize through any differentiable feature matching loss to achieve a target\n(stylized) output and leverages gradient checkpointing for memory efficiency.\nWe demonstrate a surprisingly wide-range of applications for music generation\nincluding inpainting, outpainting, and looping as well as intensity, melody,\nand musical structure control - all without ever fine-tuning the underlying\nmodel. When we compare our approach against related training, guidance, and\noptimization-based methods, we find DITTO achieves state-of-the-art performance\non nearly all tasks, including outperforming comparable approaches on\ncontrollability, audio quality, and computational efficiency, thus opening the\ndoor for high-quality, flexible, training-free control of diffusion models.\nSound examples can be found at https://DITTO-Music.github.io/web/.\n","authors":["Zachary Novack","Julian McAuley","Taylor Berg-Kirkpatrick","Nicholas J. Bryan"],"pdf_url":"https://arxiv.org/pdf/2401.12179v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2205.11359v2","updated":"2024-01-22T18:01:37Z","published":"2022-05-23T14:45:34Z","title":"Towards Size-Independent Generalization Bounds for Deep Operator Nets","summary":" In recent times machine learning methods have made significant advances in\nbecoming a useful tool for analyzing physical systems. A particularly active\narea in this theme has been \"physics-informed machine learning\" which focuses\non using neural nets for numerically solving differential equations. In this\nwork, we aim to advance the theory of measuring out-of-sample error while\ntraining DeepONets -- which is among the most versatile ways to solve PDE\nsystems in one-shot.\n Firstly, for a class of DeepONets, we prove a bound on their Rademacher\ncomplexity which does not explicitly scale with the width of the nets involved.\nSecondly, we use this to show how the Huber loss can be chosen so that for\nthese DeepONet classes generalization error bounds can be obtained that have no\nexplicit dependence on the size of the nets. We note that our theoretical\nresults apply to any PDE being targeted to be solved by DeepONets.\n","authors":["Pulkit Gopalani","Sayar Karmakar","Dibyakanti Kumar","Anirbit Mukherjee"],"pdf_url":"https://arxiv.org/pdf/2205.11359v2.pdf","comment":"27 pages, 5 figures; Added theorem on generalization error indicating\n benefits of training DeepONets on the Huber loss and corresponding\n experiments"},{"id":"http://arxiv.org/abs/2401.12168v1","updated":"2024-01-22T18:01:01Z","published":"2024-01-22T18:01:01Z","title":"SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning\n Capabilities","summary":" Understanding and reasoning about spatial relationships is a fundamental\ncapability for Visual Question Answering (VQA) and robotics. While Vision\nLanguage Models (VLM) have demonstrated remarkable performance in certain VQA\nbenchmarks, they still lack capabilities in 3D spatial reasoning, such as\nrecognizing quantitative relationships of physical objects like distances or\nsize differences. We hypothesize that VLMs' limited spatial reasoning\ncapability is due to the lack of 3D spatial knowledge in training data and aim\nto solve this problem by training VLMs with Internet-scale spatial reasoning\ndata. To this end, we present a system to facilitate this approach. We first\ndevelop an automatic 3D spatial VQA data generation framework that scales up to\n2 billion VQA examples on 10 million real-world images. We then investigate\nvarious factors in the training recipe, including data quality, training\npipeline, and VLM architecture. Our work features the first internet-scale 3D\nspatial reasoning dataset in metric space. By training a VLM on such data, we\nsignificantly enhance its ability on both qualitative and quantitative spatial\nVQA. Finally, we demonstrate that this VLM unlocks novel downstream\napplications in chain-of-thought spatial reasoning and robotics due to its\nquantitative estimation capability. Project website:\nhttps://spatial-vlm.github.io/\n","authors":["Boyuan Chen","Zhuo Xu","Sean Kirmani","Brian Ichter","Danny Driess","Pete Florence","Dorsa Sadigh","Leonidas Guibas","Fei Xia"],"pdf_url":"https://arxiv.org/pdf/2401.12168v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.08573v2","updated":"2024-01-22T17:54:58Z","published":"2024-01-16T18:58:36Z","title":"Benchmarking the Robustness of Image Watermarks","summary":" This paper investigates the weaknesses of image watermarking techniques. We\npresent WAVES (Watermark Analysis Via Enhanced Stress-testing), a novel\nbenchmark for assessing watermark robustness, overcoming the limitations of\ncurrent evaluation methods.WAVES integrates detection and identification tasks,\nand establishes a standardized evaluation protocol comprised of a diverse range\nof stress tests. The attacks in WAVES range from traditional image distortions\nto advanced and novel variations of diffusive, and adversarial attacks. Our\nevaluation examines two pivotal dimensions: the degree of image quality\ndegradation and the efficacy of watermark detection after attacks. We develop a\nseries of Performance vs. Quality 2D plots, varying over several prominent\nimage similarity metrics, which are then aggregated in a heuristically novel\nmanner to paint an overall picture of watermark robustness and attack potency.\nOur comprehensive evaluation reveals previously undetected vulnerabilities of\nseveral modern watermarking algorithms. We envision WAVES as a toolkit for the\nfuture development of robust watermarking systems. The project is available at\nhttps://wavesbench.github.io/\n","authors":["Bang An","Mucong Ding","Tahseen Rabbani","Aakriti Agrawal","Yuancheng Xu","Chenghao Deng","Sicheng Zhu","Abdirisak Mohamed","Yuxin Wen","Tom Goldstein","Furong Huang"],"pdf_url":"https://arxiv.org/pdf/2401.08573v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12149v1","updated":"2024-01-22T17:36:23Z","published":"2024-01-22T17:36:23Z","title":"Personalized Over-the-Air Federated Learning with Personalized\n Reconfigurable Intelligent Surfaces","summary":" Over-the-air federated learning (OTA-FL) provides bandwidth-efficient\nlearning by leveraging the inherent superposition property of wireless\nchannels. Personalized federated learning balances performance for users with\ndiverse datasets, addressing real-life data heterogeneity. We propose the first\npersonalized OTA-FL scheme through multi-task learning, assisted by personal\nreconfigurable intelligent surfaces (RIS) for each user. We take a cross-layer\napproach that optimizes communication and computation resources for global and\npersonalized tasks in time-varying channels with imperfect channel state\ninformation, using multi-task learning for non-i.i.d data. Our PROAR-PFed\nalgorithm adaptively designs power, local iterations, and RIS configurations.\nWe present convergence analysis for non-convex objectives and demonstrate that\nPROAR-PFed outperforms state-of-the-art on the Fashion-MNIST dataset.\n","authors":["Jiayu Mao","Aylin Yener"],"pdf_url":"https://arxiv.org/pdf/2401.12149v1.pdf","comment":"Copyright 2024 IEEE. Published in ICASSP 2024, 14-19 April, Seoul,\n Korea. Personal use of this material is permitted. However, permission to\n reprint/republish this material for advertising or promotional purposes or\n for creating new collective works for resale or redistribution to servers or\n lists, or to reuse any copyrighted component of this work in other works,\n must be obtained from the IEEE"},{"id":"http://arxiv.org/abs/2401.12133v1","updated":"2024-01-22T17:15:02Z","published":"2024-01-22T17:15:02Z","title":"VRMN-bD: A Multi-modal Natural Behavior Dataset of Immersive Human Fear\n Responses in VR Stand-up Interactive Games","summary":" Understanding and recognizing emotions are important and challenging issues\nin the metaverse era. Understanding, identifying, and predicting fear, which is\none of the fundamental human emotions, in virtual reality (VR) environments\nplays an essential role in immersive game development, scene development, and\nnext-generation virtual human-computer interaction applications. In this\narticle, we used VR horror games as a medium to analyze fear emotions by\ncollecting multi-modal data (posture, audio, and physiological signals) from 23\nplayers. We used an LSTM-based model to predict fear with accuracies of 65.31%\nand 90.47% under 6-level classification (no fear and five different levels of\nfear) and 2-level classification (no fear and fear), respectively. We\nconstructed a multi-modal natural behavior dataset of immersive human fear\nresponses (VRMN-bD) and compared it with existing relevant advanced datasets.\nThe results show that our dataset has fewer limitations in terms of collection\nmethod, data scale and audience scope. We are unique and advanced in targeting\nmulti-modal datasets of fear and behavior in VR stand-up interactive\nenvironments. Moreover, we discussed the implications of this work for\ncommunities and applications. The dataset and pre-trained model are available\nat https://github.com/KindOPSTAR/VRMN-bD.\n","authors":["He Zhang","Xinyang Li","Yuanxi Sun","Xinyi Fu","Christine Qiu","John M. Carroll"],"pdf_url":"https://arxiv.org/pdf/2401.12133v1.pdf","comment":"Accepted to IEEE VR 2024"},{"id":"http://arxiv.org/abs/2401.12132v1","updated":"2024-01-22T17:14:47Z","published":"2024-01-22T17:14:47Z","title":"Evaluation of QCNN-LSTM for Disability Forecasting in Multiple Sclerosis\n Using Sequential Multisequence MRI","summary":" Introduction Quantum Convolutional Neural Network (QCNN)-Long Short-Term\nMemory (LSTM) models were studied to provide sequential relationships for each\ntimepoint in MRIs of patients with Multiple Sclerosis (MS). In this pilot\nstudy, we compared three QCNN-LSTM models for binary classification of MS\ndisability benchmarked against classical neural network architectures. Our\nhypothesis is that quantum models will provide competitive performance. Methods\nMatrix Product State (MPS), reverse Multistate Entanglement Renormalization\nAnsatz (MERA), and Tree-Tensor Network (TTN) circuits were paired with LSTM\nlayer to process near-annual MRI data of patients diagnosed with MS. These were\nbenchmarked against a Visual Geometry Group (VGG)-LSTM and a Video Vision\nTransformer (ViViT). Predicted logits were measured against ground truth labels\nof each patient's Extended Disability Severity Score (EDSS) using binary\ncross-entropy loss. Training/validation/holdout testing was partitioned using\n5-fold cross validation with a total split of 60:20:20. Levene's test of\nvariance was used to measure statistical difference and Student's t-test for\npaired model differences in mean. Results The MPS-LSTM, reverse MERA-LSTM, and\nTTN-LSTM had holdout testing ROC-AUC of 0.70, 0.77, and 0.81, respectively\n(p-value 0.915). VGG16-LSTM and ViViT performed similarly with ROC-AUC of 0.73\nand 0.77, respectively (p-value 0.631). Overall variance and mean were not\nstatistically significant (p-value 0.713), however, time to train was\nsignificantly faster for the QCNN-LSTMs (39.4 sec per fold vs. 224 and 218,\nrespectively, p-value <0.001). Conclusion QCNN-LSTM models perform\ncompetitively to their classical counterparts with greater efficiency in train\ntime. Clinically, these can add value in terms of efficiency to time-dependent\ndeep learning prediction of disease progression based upon medical imaging.\n","authors":["John D. Mayfield","Issam El Naqa"],"pdf_url":"https://arxiv.org/pdf/2401.12132v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12131v1","updated":"2024-01-22T17:13:50Z","published":"2024-01-22T17:13:50Z","title":"NeuroSynt: A Neuro-symbolic Portfolio Solver for Reactive Synthesis","summary":" We introduce NeuroSynt, a neuro-symbolic portfolio solver framework for\nreactive synthesis. At the core of the solver lies a seamless integration of\nneural and symbolic approaches to solving the reactive synthesis problem. To\nensure soundness, the neural engine is coupled with model checkers verifying\nthe predictions of the underlying neural models. The open-source implementation\nof NeuroSynt provides an integration framework for reactive synthesis in which\nnew neural and state-of-the-art symbolic approaches can be seamlessly\nintegrated. Extensive experiments demonstrate its efficacy in handling\nchallenging specifications, enhancing the state-of-the-art reactive synthesis\nsolvers, with NeuroSynt contributing novel solves in the current SYNTCOMP\nbenchmarks.\n","authors":["Matthias Cosler","Christopher Hahn","Ayham Omar","Frederik Schmitt"],"pdf_url":"https://arxiv.org/pdf/2401.12131v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.06144v2","updated":"2024-01-22T17:11:57Z","published":"2023-11-30T23:31:33Z","title":"DFU: scale-robust diffusion model for zero-shot super-resolution image\n generation","summary":" Diffusion generative models have achieved remarkable success in generating\nimages with a fixed resolution. However, existing models have limited ability\nto generalize to different resolutions when training data at those resolutions\nare not available. Leveraging techniques from operator learning, we present a\nnovel deep-learning architecture, Dual-FNO UNet (DFU), which approximates the\nscore operator by combining both spatial and spectral information at multiple\nresolutions. Comparisons of DFU to baselines demonstrate its scalability: 1)\nsimultaneously training on multiple resolutions improves FID over training at\nany single fixed resolution; 2) DFU generalizes beyond its training\nresolutions, allowing for coherent, high-fidelity generation at\nhigher-resolutions with the same model, i.e. zero-shot super-resolution\nimage-generation; 3) we propose a fine-tuning strategy to further enhance the\nzero-shot super-resolution image-generation capability of our model, leading to\na FID of 11.3 at 1.66 times the maximum training resolution on FFHQ, which no\nother method can come close to achieving.\n","authors":["Alex Havrilla","Kevin Rojas","Wenjing Liao","Molei Tao"],"pdf_url":"https://arxiv.org/pdf/2401.06144v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12129v1","updated":"2024-01-22T17:11:01Z","published":"2024-01-22T17:11:01Z","title":"Out-of-Distribution Detection & Applications With Ablated Learned\n Temperature Energy","summary":" As deep neural networks become adopted in high-stakes domains, it is crucial\nto be able to identify when inference inputs are Out-of-Distribution (OOD) so\nthat users can be alerted of likely drops in performance and calibration\ndespite high confidence. Among many others, existing methods use the following\ntwo scores to do so without training on any apriori OOD examples: a learned\ntemperature and an energy score. In this paper we introduce Ablated Learned\nTemperature Energy (or \"AbeT\" for short), a method which combines these prior\nmethods in novel ways with effective modifications. Due to these contributions,\nAbeT lowers the False Positive Rate at $95\\%$ True Positive Rate (FPR@95) by\n$35.39\\%$ in classification (averaged across all ID and OOD datasets measured)\ncompared to state of the art without training networks in multiple stages or\nrequiring hyperparameters or test-time backward passes. We additionally provide\nempirical insights as to how our model learns to distinguish between\nIn-Distribution (ID) and OOD samples while only being explicitly trained on ID\nsamples via exposure to misclassified ID examples at training time. Lastly, we\nshow the efficacy of our method in identifying predicted bounding boxes and\npixels corresponding to OOD objects in object detection and semantic\nsegmentation, respectively - with an AUROC increase of $5.15\\%$ in object\ndetection and both a decrease in FPR@95 of $41.48\\%$ and an increase in AUPRC\nof $34.20\\%$ on average in semantic segmentation compared to previous state of\nthe art.\n","authors":["Will LeVine","Benjamin Pikus","Jacob Phillips","Berk Norman","Fernando Amat Gil","Sean Hendryx"],"pdf_url":"https://arxiv.org/pdf/2401.12129v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.15462v2","updated":"2024-01-22T17:02:16Z","published":"2023-09-27T07:57:37Z","title":"DTC: Deep Tracking Control","summary":" Legged locomotion is a complex control problem that requires both accuracy\nand robustness to cope with real-world challenges. Legged systems have\ntraditionally been controlled using trajectory optimization with inverse\ndynamics. Such hierarchical model-based methods are appealing due to intuitive\ncost function tuning, accurate planning, generalization, and most importantly,\nthe insightful understanding gained from more than one decade of extensive\nresearch. However, model mismatch and violation of assumptions are common\nsources of faulty operation. Simulation-based reinforcement learning, on the\nother hand, results in locomotion policies with unprecedented robustness and\nrecovery skills. Yet, all learning algorithms struggle with sparse rewards\nemerging from environments where valid footholds are rare, such as gaps or\nstepping stones. In this work, we propose a hybrid control architecture that\ncombines the advantages of both worlds to simultaneously achieve greater\nrobustness, foot-placement accuracy, and terrain generalization. Our approach\nutilizes a model-based planner to roll out a reference motion during training.\nA deep neural network policy is trained in simulation, aiming to track the\noptimized footholds. We evaluate the accuracy of our locomotion pipeline on\nsparse terrains, where pure data-driven methods are prone to fail. Furthermore,\nwe demonstrate superior robustness in the presence of slippery or deformable\nground when compared to model-based counterparts. Finally, we show that our\nproposed tracking controller generalizes across different trajectory\noptimization methods not seen during training. In conclusion, our work unites\nthe predictive capabilities and optimality guarantees of online planning with\nthe inherent robustness attributed to offline learning.\n","authors":["Fabian Jenelten","Junzhe He","Farbod Farshidian","Marco Hutter"],"pdf_url":"https://arxiv.org/pdf/2309.15462v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12113v1","updated":"2024-01-22T16:51:01Z","published":"2024-01-22T16:51:01Z","title":"Extracting Formulae in Many-Valued Logic from Deep Neural Networks","summary":" We propose a new perspective on deep ReLU networks, namely as circuit\ncounterparts of Lukasiewicz infinite-valued logic -- a many-valued (MV)\ngeneralization of Boolean logic. An algorithm for extracting formulae in MV\nlogic from deep ReLU networks is presented. As the algorithm applies to\nnetworks with general, in particular also real-valued, weights, it can be used\nto extract logical formulae from deep ReLU networks trained on data.\n","authors":["Yani Zhang","Helmut Bölcskei"],"pdf_url":"https://arxiv.org/pdf/2401.12113v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12108v1","updated":"2024-01-22T16:45:15Z","published":"2024-01-22T16:45:15Z","title":"On-Time Delivery in Crowdshipping Systems: An Agent-Based Approach Using\n Streaming Data","summary":" In parcel delivery, the \"last mile\" from the parcel hub to the customer is\ncostly, especially for time-sensitive delivery tasks that have to be completed\nwithin hours after arrival. Recently, crowdshipping has attracted increased\nattention as a new alternative to traditional delivery modes. In crowdshipping,\nprivate citizens (\"the crowd\") perform short detours in their daily lives to\ncontribute to parcel delivery in exchange for small incentives. However,\nachieving desirable crowd behavior is challenging as the crowd is highly\ndynamic and consists of autonomous, self-interested individuals. Leveraging\ncrowdshipping for time-sensitive deliveries remains an open challenge. In this\npaper, we present an agent-based approach to on-time parcel delivery with\ncrowds. Our system performs data stream processing on the couriers' smartphone\nsensor data to predict delivery delays. Whenever a delay is predicted, the\nsystem attempts to forge an agreement for transferring the parcel from the\ncurrent deliverer to a more promising courier nearby. Our experiments show that\nthrough accurate delay predictions and purposeful task transfers many delays\ncan be prevented that would occur without our approach.\n","authors":["Jeremias Dötterl","Ralf Bruns","Jürgen Dunkel","Sascha Ossowski"],"pdf_url":"https://arxiv.org/pdf/2401.12108v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12103v1","updated":"2024-01-22T16:38:33Z","published":"2024-01-22T16:38:33Z","title":"LearnedWMP: Workload Memory Prediction Using Distribution of Query\n Templates","summary":" In a modern DBMS, working memory is frequently the limiting factor when\nprocessing in-memory analytic query operations such as joins, sorting, and\naggregation. Existing resource estimation approaches for a DBMS estimate the\nresource consumption of a query by computing an estimate of each individual\ndatabase operator in the query execution plan. Such an approach is slow and\nerror-prone as it relies upon simplifying assumptions, such as uniformity and\nindependence of the underlying data. Additionally, the existing approach\nfocuses on individual queries separately and does not factor in other queries\nin the workload that may be executed concurrently. In this research, we are\ninterested in query performance optimization under concurrent execution of a\nbatch of queries (a workload). Specifically, we focus on predicting the memory\ndemand for a workload rather than providing separate estimates for each query\nwithin it. We introduce the problem of workload memory prediction and formalize\nit as a distribution regression problem. We propose Learned Workload Memory\nPrediction (LearnedWMP) to improve and simplify estimating the working memory\ndemands of workloads. Through a comprehensive experimental evaluation, we show\nthat LearnedWMP reduces the memory estimation error of the\nstate-of-the-practice method by up to 47.6%. Compared to an alternative\nsingle-query model, during training and inferencing, the LearnedWMP model and\nits variants were 3x to 10x faster. Moreover, LearnedWMP-based models were at\nleast 50% smaller in most cases. Overall, the results demonstrate the\nadvantages of the LearnedWMP approach and its potential for a broader impact on\nquery performance optimization.\n","authors":["Shaikh Quader","Andres Jaramillo","Sumona Mukhopadhyay","Ghadeer Abuoda","Calisto Zuzarte","David Kalmuk","Marin Litoiu","Manos Papagelis"],"pdf_url":"https://arxiv.org/pdf/2401.12103v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.17028v2","updated":"2024-01-22T16:25:13Z","published":"2023-05-26T15:36:59Z","title":"Better Batch for Deep Probabilistic Time Series Forecasting","summary":" Deep probabilistic time series forecasting has gained significant attention\ndue to its superior performance in nonlinear approximation and its ability to\nprovide valuable uncertainty quantification for decision-making tasks. However,\nmany existing models oversimplify the problem by assuming that the error\nprocess is time-independent, thereby overlooking the serial correlation in the\nerror process. To overcome this limitation, we propose an innovative training\nmethod that incorporates error autocorrelation to further enhance the accuracy\nof probabilistic forecasting. Our method involves constructing a mini-batch as\na collection of $D$ consecutive time series segments for model training and\nexplicitly learning a time-varying covariance matrix over each mini-batch that\nencodes the error correlation among adjacent time steps. The learned covariance\nmatrix can be used to improve prediction accuracy and enhance uncertainty\nquantification. We evaluate our method on two different neural forecasting\nmodels and multiple public datasets, and the experimental results confirm the\neffectiveness of the proposed approach in enhancing the performance of both\nmodels across a wide range of datasets, yielding notable improvements in\npredictive accuracy.\n","authors":["Vincent Zhihao Zheng","Seongjin Choi","Lijun Sun"],"pdf_url":"https://arxiv.org/pdf/2305.17028v2.pdf","comment":"9 pages, 3 figures, camera-ready version, The 27th International\n Conference on Artificial Intelligence and Statistics (AISTATS 2024)"},{"id":"http://arxiv.org/abs/2401.12086v1","updated":"2024-01-22T16:24:43Z","published":"2024-01-22T16:24:43Z","title":"West-of-N: Synthetic Preference Generation for Improved Reward Modeling","summary":" The success of reinforcement learning from human feedback (RLHF) in language\nmodel alignment is strongly dependent on the quality of the underlying reward\nmodel. In this paper, we present a novel approach to improve reward model\nquality by generating synthetic preference data, thereby augmenting the\ntraining dataset with on-policy, high-quality preference pairs. Motivated by\nthe promising results of Best-of-N sampling strategies in language model\ntraining, we extend their application to reward model training. This results in\na self-training strategy to generate preference pairs by selecting the best and\nworst candidates in a pool of responses to a given query. Empirically, we find\nthat this approach improves the performance of any reward model, with an effect\ncomparable to the addition of a similar quantity of human preference data. This\nwork opens up new avenues of research for improving RLHF for language model\nalignment, by offering synthetic preference generation as a solution to reward\nmodeling challenges.\n","authors":["Alizée Pace","Jonathan Mallinson","Eric Malmi","Sebastian Krause","Aliaksei Severyn"],"pdf_url":"https://arxiv.org/pdf/2401.12086v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12079v1","updated":"2024-01-22T16:21:19Z","published":"2024-01-22T16:21:19Z","title":"Collaborative Reinforcement Learning Based Unmanned Aerial Vehicle (UAV)\n Trajectory Design for 3D UAV Tracking","summary":" In this paper, the problem of using one active unmanned aerial vehicle (UAV)\nand four passive UAVs to localize a 3D target UAV in real time is investigated.\nIn the considered model, each passive UAV receives reflection signals from the\ntarget UAV, which are initially transmitted by the active UAV. The received\nreflection signals allow each passive UAV to estimate the signal transmission\ndistance which will be transmitted to a base station (BS) for the estimation of\nthe position of the target UAV. Due to the movement of the target UAV, each\nactive/passive UAV must optimize its trajectory to continuously localize the\ntarget UAV. Meanwhile, since the accuracy of the distance estimation depends on\nthe signal-to-noise ratio of the transmission signals, the active UAV must\noptimize its transmit power. This problem is formulated as an optimization\nproblem whose goal is to jointly optimize the transmit power of the active UAV\nand trajectories of both active and passive UAVs so as to maximize the target\nUAV positioning accuracy. To solve this problem, a Z function decomposition\nbased reinforcement learning (ZD-RL) method is proposed. Compared to value\nfunction decomposition based RL (VD-RL), the proposed method can find the\nprobability distribution of the sum of future rewards to accurately estimate\nthe expected value of the sum of future rewards thus finding better transmit\npower of the active UAV and trajectories for both active and passive UAVs and\nimproving target UAV positioning accuracy. Simulation results show that the\nproposed ZD-RL method can reduce the positioning errors by up to 39.4% and\n64.6%, compared to VD-RL and independent deep RL methods, respectively.\n","authors":["Yujiao Zhu","Mingzhe Chen","Sihua Wang","Ye Hu","Yuchen Liu","Changchuan Yin"],"pdf_url":"https://arxiv.org/pdf/2401.12079v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12070v1","updated":"2024-01-22T16:09:47Z","published":"2024-01-22T16:09:47Z","title":"Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated\n Text","summary":" Detecting text generated by modern large language models is thought to be\nhard, as both LLMs and humans can exhibit a wide range of complex behaviors.\nHowever, we find that a score based on contrasting two closely related language\nmodels is highly accurate at separating human-generated and machine-generated\ntext. Based on this mechanism, we propose a novel LLM detector that only\nrequires simple calculations using a pair of pre-trained LLMs. The method,\ncalled Binoculars, achieves state-of-the-art accuracy without any training\ndata. It is capable of spotting machine text from a range of modern LLMs\nwithout any model-specific modifications. We comprehensively evaluate\nBinoculars on a number of text sources and in varied situations. Over a wide\nrange of document types, Binoculars detects over 90% of generated samples from\nChatGPT (and other LLMs) at a false positive rate of 0.01%, despite not being\ntrained on any ChatGPT data.\n","authors":["Abhimanyu Hans","Avi Schwarzschild","Valeriia Cherepanova","Hamid Kazemi","Aniruddha Saha","Micah Goldblum","Jonas Geiping","Tom Goldstein"],"pdf_url":"https://arxiv.org/pdf/2401.12070v1.pdf","comment":"20 pages, code available at https://github.com/ahans30/Binoculars"},{"id":"http://arxiv.org/abs/2401.12069v1","updated":"2024-01-22T16:08:41Z","published":"2024-01-22T16:08:41Z","title":"Beyond TreeSHAP: Efficient Computation of Any-Order Shapley Interactions\n for Tree Ensembles","summary":" While shallow decision trees may be interpretable, larger ensemble models\nlike gradient-boosted trees, which often set the state of the art in machine\nlearning problems involving tabular data, still remain black box models. As a\nremedy, the Shapley value (SV) is a well-known concept in explainable\nartificial intelligence (XAI) research for quantifying additive feature\nattributions of predictions. The model-specific TreeSHAP methodology solves the\nexponential complexity for retrieving exact SVs from tree-based models.\nExpanding beyond individual feature attribution, Shapley interactions reveal\nthe impact of intricate feature interactions of any order. In this work, we\npresent TreeSHAP-IQ, an efficient method to compute any-order additive Shapley\ninteractions for predictions of tree-based models. TreeSHAP-IQ is supported by\na mathematical framework that exploits polynomial arithmetic to compute the\ninteraction scores in a single recursive traversal of the tree, akin to Linear\nTreeSHAP. We apply TreeSHAP-IQ on state-of-the-art tree ensembles and explore\ninteractions on well-established benchmark datasets.\n","authors":["Maximilian Muschalik","Fabian Fumagalli","Barbara Hammer","Eyke Hüllermeier"],"pdf_url":"https://arxiv.org/pdf/2401.12069v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12068v1","updated":"2024-01-22T16:05:30Z","published":"2024-01-22T16:05:30Z","title":"Resource-constrained stereo singing voice cancellation","summary":" We study the problem of stereo singing voice cancellation, a subtask of music\nsource separation, whose goal is to estimate an instrumental background from a\nstereo mix. We explore how to achieve performance similar to large\nstate-of-the-art source separation networks starting from a small, efficient\nmodel for real-time speech separation. Such a model is useful when memory and\ncompute are limited and singing voice processing has to run with limited\nlook-ahead. In practice, this is realised by adapting an existing mono model to\nhandle stereo input. Improvements in quality are obtained by tuning model\nparameters and expanding the training set. Moreover, we highlight the benefits\na stereo model brings by introducing a new metric which detects attenuation\ninconsistencies between channels. Our approach is evaluated using objective\noffline metrics and a large-scale MUSHRA trial, confirming the effectiveness of\nour techniques in stringent listening tests.\n","authors":["Clara Borrelli","James Rae","Dogac Basaran","Matt McVicar","Mehrez Souden","Matthias Mauch"],"pdf_url":"https://arxiv.org/pdf/2401.12068v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12058v1","updated":"2024-01-22T15:50:32Z","published":"2024-01-22T15:50:32Z","title":"The Dimension Strikes Back with Gradients: Generalization of Gradient\n Methods in Stochastic Convex Optimization","summary":" We study the generalization performance of gradient methods in the\nfundamental stochastic convex optimization setting, focusing on its dimension\ndependence. First, for full-batch gradient descent (GD) we give a construction\nof a learning problem in dimension $d=O(n^2)$, where the canonical version of\nGD (tuned for optimal performance of the empirical risk) trained with $n$\ntraining examples converges, with constant probability, to an approximate\nempirical risk minimizer with $\\Omega(1)$ population excess risk. Our bound\ntranslates to a lower bound of $\\Omega (\\sqrt{d})$ on the number of training\nexamples required for standard GD to reach a non-trivial test error, answering\nan open question raised by Feldman (2016) and Amir, Koren, and Livni (2021b)\nand showing that a non-trivial dimension dependence is unavoidable.\nFurthermore, for standard one-pass stochastic gradient descent (SGD), we show\nthat an application of the same construction technique provides a similar\n$\\Omega(\\sqrt{d})$ lower bound for the sample complexity of SGD to reach a\nnon-trivial empirical error, despite achieving optimal test performance. This\nagain provides an exponential improvement in the dimension dependence compared\nto previous work (Koren, Livni, Mansour, and Sherman, 2022), resolving an open\nquestion left therein.\n","authors":["Matan Schliserman","Uri Sherman","Tomer Koren"],"pdf_url":"https://arxiv.org/pdf/2401.12058v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12055v1","updated":"2024-01-22T15:47:05Z","published":"2024-01-22T15:47:05Z","title":"NEUROSEC: FPGA-Based Neuromorphic Audio Security","summary":" Neuromorphic systems, inspired by the complexity and functionality of the\nhuman brain, have gained interest in academic and industrial attention due to\ntheir unparalleled potential across a wide range of applications. While their\ncapabilities herald innovation, it is imperative to underscore that these\ncomputational paradigms, analogous to their traditional counterparts, are not\nimpervious to security threats. Although the exploration of neuromorphic\nmethodologies for image and video processing has been rigorously pursued, the\nrealm of neuromorphic audio processing remains in its early stages. Our results\nhighlight the robustness and precision of our FPGA-based neuromorphic system.\nSpecifically, our system showcases a commendable balance between desired signal\nand background noise, efficient spike rate encoding, and unparalleled\nresilience against adversarial attacks such as FGSM and PGD. A standout feature\nof our framework is its detection rate of 94%, which, when compared to other\nmethodologies, underscores its greater capability in identifying and mitigating\nthreats within 5.39 dB, a commendable SNR ratio. Furthermore, neuromorphic\ncomputing and hardware security serve many sensor domains in mission-critical\nand privacy-preserving applications.\n","authors":["Murat Isik","Hiruna Vishwamith","Yusuf Sur","Kayode Inadagbo","I. Can Dikmen"],"pdf_url":"https://arxiv.org/pdf/2401.12055v1.pdf","comment":"Audio processing, FPGA, Hardware Security, Neuromorphic Computing"},{"id":"http://arxiv.org/abs/2401.12046v1","updated":"2024-01-22T15:38:29Z","published":"2024-01-22T15:38:29Z","title":"Fourier Transporter: Bi-Equivariant Robotic Manipulation in 3D","summary":" Many complex robotic manipulation tasks can be decomposed as a sequence of\npick and place actions. Training a robotic agent to learn this sequence over\nmany different starting conditions typically requires many iterations or\ndemonstrations, especially in 3D environments. In this work, we propose Fourier\nTransporter (\\ours{}) which leverages the two-fold $\\SE(d)\\times\\SE(d)$\nsymmetry in the pick-place problem to achieve much higher sample efficiency.\n\\ours{} is an open-loop behavior cloning method trained using expert\ndemonstrations to predict pick-place actions on new environments. \\ours{} is\nconstrained to incorporate symmetries of the pick and place actions\nindependently. Our method utilizes a fiber space Fourier transformation that\nallows for memory-efficient construction. We test our proposed network on the\nRLbench benchmark and achieve state-of-the-art results across various tasks.\n","authors":["Haojie Huang","Owen Howell","Xupeng Zhu","Dian Wang","Robin Walters","Robert Platt"],"pdf_url":"https://arxiv.org/pdf/2401.12046v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.08865v2","updated":"2024-01-22T15:30:08Z","published":"2024-01-16T22:36:23Z","title":"The Effect of Intrinsic Dataset Properties on Generalization: Unraveling\n Learning Differences Between Natural and Medical Images","summary":" This paper investigates discrepancies in how neural networks learn from\ndifferent imaging domains, which are commonly overlooked when adopting computer\nvision techniques from the domain of natural images to other specialized\ndomains such as medical images. Recent works have found that the generalization\nerror of a trained network typically increases with the intrinsic dimension\n($d_{data}$) of its training set. Yet, the steepness of this relationship\nvaries significantly between medical (radiological) and natural imaging\ndomains, with no existing theoretical explanation. We address this gap in\nknowledge by establishing and empirically validating a generalization scaling\nlaw with respect to $d_{data}$, and propose that the substantial scaling\ndiscrepancy between the two considered domains may be at least partially\nattributed to the higher intrinsic \"label sharpness\" ($K_F$) of medical imaging\ndatasets, a metric which we propose. Next, we demonstrate an additional benefit\nof measuring the label sharpness of a training set: it is negatively correlated\nwith the trained model's adversarial robustness, which notably leads to models\nfor medical images having a substantially higher vulnerability to adversarial\nattack. Finally, we extend our $d_{data}$ formalism to the related metric of\nlearned representation intrinsic dimension ($d_{repr}$), derive a\ngeneralization scaling law with respect to $d_{repr}$, and show that $d_{data}$\nserves as an upper bound for $d_{repr}$. Our theoretical results are supported\nby thorough experiments with six models and eleven natural and medical imaging\ndatasets over a range of training set sizes. Our findings offer insights into\nthe influence of intrinsic dataset properties on generalization, representation\nlearning, and robustness in deep neural networks.\n","authors":["Nicholas Konz","Maciej A. Mazurowski"],"pdf_url":"https://arxiv.org/pdf/2401.08865v2.pdf","comment":"ICLR 2024. Code:\n https://github.com/mazurowski-lab/intrinsic-properties"},{"id":"http://arxiv.org/abs/2401.12033v1","updated":"2024-01-22T15:19:18Z","published":"2024-01-22T15:19:18Z","title":"Momentum-SAM: Sharpness Aware Minimization without Computational\n Overhead","summary":" The recently proposed optimization algorithm for deep neural networks\nSharpness Aware Minimization (SAM) suggests perturbing parameters before\ngradient calculation by a gradient ascent step to guide the optimization into\nparameter space regions of flat loss. While significant generalization\nimprovements and thus reduction of overfitting could be demonstrated, the\ncomputational costs are doubled due to the additionally needed gradient\ncalculation, making SAM unfeasible in case of limited computationally\ncapacities. Motivated by Nesterov Accelerated Gradient (NAG) we propose\nMomentum-SAM (MSAM), which perturbs parameters in the direction of the\naccumulated momentum vector to achieve low sharpness without significant\ncomputational overhead or memory demands over SGD or Adam. We evaluate MSAM in\ndetail and reveal insights on separable mechanisms of NAG, SAM and MSAM\nregarding training optimization and generalization. Code is available at\nhttps://github.com/MarlonBecker/MSAM.\n","authors":["Marlon Becker","Frederick Altrock","Benjamin Risse"],"pdf_url":"https://arxiv.org/pdf/2401.12033v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12024v1","updated":"2024-01-22T15:11:57Z","published":"2024-01-22T15:11:57Z","title":"Multimodal Visual-Tactile Representation Learning through\n Self-Supervised Contrastive Pre-Training","summary":" The rapidly evolving field of robotics necessitates methods that can\nfacilitate the fusion of multiple modalities. Specifically, when it comes to\ninteracting with tangible objects, effectively combining visual and tactile\nsensory data is key to understanding and navigating the complex dynamics of the\nphysical world, enabling a more nuanced and adaptable response to changing\nenvironments. Nevertheless, much of the earlier work in merging these two\nsensory modalities has relied on supervised methods utilizing datasets labeled\nby humans.This paper introduces MViTac, a novel methodology that leverages\ncontrastive learning to integrate vision and touch sensations in a\nself-supervised fashion. By availing both sensory inputs, MViTac leverages\nintra and inter-modality losses for learning representations, resulting in\nenhanced material property classification and more adept grasping prediction.\nThrough a series of experiments, we showcase the effectiveness of our method\nand its superiority over existing state-of-the-art self-supervised and\nsupervised techniques. In evaluating our methodology, we focus on two distinct\ntasks: material classification and grasping success prediction. Our results\nindicate that MViTac facilitates the development of improved modality encoders,\nyielding more robust representations as evidenced by linear probing\nassessments.\n","authors":["Vedant Dave","Fotios Lygerakis","Elmar Rueckert"],"pdf_url":"https://arxiv.org/pdf/2401.12024v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13141v2","updated":"2024-01-22T15:07:26Z","published":"2023-12-20T16:02:25Z","title":"Augment on Manifold: Mixup Regularization with UMAP","summary":" Data augmentation techniques play an important role in enhancing the\nperformance of deep learning models. Despite their proven benefits in computer\nvision tasks, their application in the other domains remains limited. This\npaper proposes a Mixup regularization scheme, referred to as UMAP Mixup,\ndesigned for ``on-manifold\" automated data augmentation for deep learning\npredictive models. The proposed approach ensures that the Mixup operations\nresult in synthesized samples that lie on the data manifold of the features and\nlabels by utilizing a dimensionality reduction technique known as uniform\nmanifold approximation and projection. Evaluations across diverse regression\ntasks show that UMAP Mixup is competitive with or outperforms other Mixup\nvariants, show promise for its potential as an effective tool for enhancing the\ngeneralization performance of deep learning models.\n","authors":["Yousef El-Laham","Elizabeth Fons","Dillon Daudert","Svitlana Vyetrenko"],"pdf_url":"https://arxiv.org/pdf/2312.13141v2.pdf","comment":"accepted paper to be published in the proceedings of ICASSP 2024"},{"id":"http://arxiv.org/abs/2311.14212v3","updated":"2024-01-22T15:05:30Z","published":"2023-11-23T21:54:22Z","title":"Annotation Sensitivity: Training Data Collection Methods Affect Model\n Performance","summary":" When training data are collected from human annotators, the design of the\nannotation instrument, the instructions given to annotators, the\ncharacteristics of the annotators, and their interactions can impact training\ndata. This study demonstrates that design choices made when creating an\nannotation instrument also impact the models trained on the resulting\nannotations. We introduce the term annotation sensitivity to refer to the\nimpact of annotation data collection methods on the annotations themselves and\non downstream model performance and predictions. We collect annotations of hate\nspeech and offensive language in five experimental conditions of an annotation\ninstrument, randomly assigning annotators to conditions. We then fine-tune BERT\nmodels on each of the five resulting datasets and evaluate model performance on\na holdout portion of each condition. We find considerable differences between\nthe conditions for 1) the share of hate speech/offensive language annotations,\n2) model performance, 3) model predictions, and 4) model learning curves. Our\nresults emphasize the crucial role played by the annotation instrument which\nhas received little attention in the machine learning literature. We call for\nadditional research into how and why the instrument impacts the annotations to\ninform the development of best practices in instrument design.\n","authors":["Christoph Kern","Stephanie Eckman","Jacob Beck","Rob Chew","Bolei Ma","Frauke Kreuter"],"pdf_url":"https://arxiv.org/pdf/2311.14212v3.pdf","comment":"EMNLP 2023 Findings:\n https://aclanthology.org/2023.findings-emnlp.992/"},{"id":"http://arxiv.org/abs/2312.13152v2","updated":"2024-01-22T15:04:57Z","published":"2023-12-20T16:16:29Z","title":"Neural Stochastic Differential Equations with Change Points: A\n Generative Adversarial Approach","summary":" Stochastic differential equations (SDEs) have been widely used to model real\nworld random phenomena. Existing works mainly focus on the case where the time\nseries is modeled by a single SDE, which might be restrictive for modeling time\nseries with distributional shift. In this work, we propose a change point\ndetection algorithm for time series modeled as neural SDEs. Given a time series\ndataset, the proposed method jointly learns the unknown change points and the\nparameters of distinct neural SDE models corresponding to each change point.\nSpecifically, the SDEs are learned under the framework of generative\nadversarial networks (GANs) and the change points are detected based on the\noutput of the GAN discriminator in a forward pass. At each step of the proposed\nalgorithm, the change points and the SDE model parameters are updated in an\nalternating fashion. Numerical results on both synthetic and real datasets are\nprovided to validate the performance of our algorithm in comparison to\nclassical change point detection benchmarks, standard GAN-based neural SDEs,\nand other state-of-the-art deep generative models for time series data.\n","authors":["Zhongchang Sun","Yousef El-Laham","Svitlana Vyetrenko"],"pdf_url":"https://arxiv.org/pdf/2312.13152v2.pdf","comment":"accepted paper to be published in the proceedings of ICASSP 2024"},{"id":"http://arxiv.org/abs/2401.12014v1","updated":"2024-01-22T15:00:32Z","published":"2024-01-22T15:00:32Z","title":"Robustness to distribution shifts of compressed networks for edge\n devices","summary":" It is necessary to develop efficient DNNs deployed on edge devices with\nlimited computation resources. However, the compressed networks often execute\nnew tasks in the target domain, which is different from the source domain where\nthe original network is trained. It is important to investigate the robustness\nof compressed networks in two types of data distribution shifts: domain shifts\nand adversarial perturbations. In this study, we discover that compressed\nmodels are less robust to distribution shifts than their original networks.\nInterestingly, larger networks are more vulnerable to losing robustness than\nsmaller ones, even when they are compressed to a similar size as the smaller\nnetworks. Furthermore, compact networks obtained by knowledge distillation are\nmuch more robust to distribution shifts than pruned networks. Finally,\npost-training quantization is a reliable method for achieving significant\nrobustness to distribution shifts, and it outperforms both pruned and distilled\nmodels in terms of robustness.\n","authors":["Lulan Shen","Ali Edalati","Brett Meyer","Warren Gross","James J. Clark"],"pdf_url":"https://arxiv.org/pdf/2401.12014v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12012v1","updated":"2024-01-22T14:59:11Z","published":"2024-01-22T14:59:11Z","title":"TurboSVM-FL: Boosting Federated Learning through SVM Aggregation for\n Lazy Clients","summary":" Federated learning is a distributed collaborative machine learning paradigm\nthat has gained strong momentum in recent years. In federated learning, a\ncentral server periodically coordinates models with clients and aggregates the\nmodels trained locally by clients without necessitating access to local data.\nDespite its potential, the implementation of federated learning continues to\nencounter several challenges, predominantly the slow convergence that is\nlargely due to data heterogeneity. The slow convergence becomes particularly\nproblematic in cross-device federated learning scenarios where clients may be\nstrongly limited by computing power and storage space, and hence counteracting\nmethods that induce additional computation or memory cost on the client side\nsuch as auxiliary objective terms and larger training iterations can be\nimpractical. In this paper, we propose a novel federated aggregation strategy,\nTurboSVM-FL, that poses no additional computation burden on the client side and\ncan significantly accelerate convergence for federated classification task,\nespecially when clients are \"lazy\" and train their models solely for few epochs\nfor next global aggregation. TurboSVM-FL extensively utilizes support vector\nmachine to conduct selective aggregation and max-margin spread-out\nregularization on class embeddings. We evaluate TurboSVM-FL on multiple\ndatasets including FEMNIST, CelebA, and Shakespeare using user-independent\nvalidation with non-iid data distribution. Our results show that TurboSVM-FL\ncan significantly outperform existing popular algorithms on convergence rate\nand reduce communication rounds while delivering better test metrics including\naccuracy, F1 score, and MCC.\n","authors":["Mengdi Wang","Anna Bodonhelyi","Efe Bozkir","Enkelejda Kasneci"],"pdf_url":"https://arxiv.org/pdf/2401.12012v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12007v1","updated":"2024-01-22T14:55:01Z","published":"2024-01-22T14:55:01Z","title":"Tensor-view Topological Graph Neural Network","summary":" Graph classification is an important learning task for graph-structured data.\nGraph neural networks (GNNs) have recently gained growing attention in graph\nlearning and have shown significant improvements in many important graph\nproblems. Despite their state-of-the-art performances, existing GNNs only use\nlocal information from a very limited neighborhood around each node, suffering\nfrom loss of multi-modal information and overheads of excessive computation. To\naddress these issues, we propose a novel Tensor-view Topological Graph Neural\nNetwork (TTG-NN), a class of simple yet effective topological deep learning\nbuilt upon persistent homology, graph convolution, and tensor operations. This\nnew method incorporates tensor learning to simultaneously capture Tensor-view\nTopological (TT), as well as Tensor-view Graph (TG) structural information on\nboth local and global levels. Computationally, to fully exploit graph topology\nand structure, we propose two flexible TT and TG representation learning\nmodules that disentangle feature tensor aggregation and transformation and\nlearn to preserve multi-modal structure with less computation. Theoretically,\nwe derive high probability bounds on both the out-of-sample and in-sample mean\nsquared approximation errors for our proposed Tensor Transformation Layer\n(TTL). Real data experiments show that the proposed TTG-NN outperforms 20\nstate-of-the-art methods on various graph benchmarks.\n","authors":["Tao Wen","Elynn Chen","Yuzhou Chen"],"pdf_url":"https://arxiv.org/pdf/2401.12007v1.pdf","comment":"Accepted at AISTATS 2024"},{"id":"http://arxiv.org/abs/2309.12701v2","updated":"2024-01-22T14:53:22Z","published":"2023-09-22T08:18:08Z","title":"Decision Tree Search as a Markov Decision Problem","summary":" Finding an optimal decision tree for a supervised learning task is a\nchallenging combinatorial problem to solve at scale. It was recently proposed\nto frame the problem as a Markov Decision Problem (MDP) and use deep\nreinforcement learning to tackle scaling. Unfortunately, these methods are not\ncompetitive with the current branch-and-bound state-of-the-art. We propose\ninstead to scale the resolution of such MDPs using an information-theoretic\ntests generating function that heuristically, and dynamically for every state,\nlimits the set of admissible test actions to a few good candidates. As a\nsolver, we show empirically that our algorithm is at the very least competitive\nwith branch-and-bound alternatives. As a machine learning tool, a key advantage\nof our approach is to solve for multiple complexity-performance trade-offs at\nvirtually no additional cost. With such a set of solutions, a user can then\nselect the tree that generalizes best and which has the interpretability level\nthat best suits their needs, which no current branch-and-bound method allows.\n","authors":["Hector Kohler","Riad Akrour","Philippe Preux"],"pdf_url":"https://arxiv.org/pdf/2309.12701v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12004v1","updated":"2024-01-22T14:53:21Z","published":"2024-01-22T14:53:21Z","title":"NLCG-Net: A Model-Based Zero-Shot Learning Framework for Undersampled\n Quantitative MRI Reconstruction","summary":" Typical quantitative MRI (qMRI) methods estimate parameter maps after image\nreconstructing, which is prone to biases and error propagation. We propose a\nNonlinear Conjugate Gradient (NLCG) optimizer for model-based T2/T1 estimation,\nwhich incorporates U-Net regularization trained in a scan-specific manner. This\nend-to-end method directly estimates qMRI maps from undersampled k-space data\nusing mono-exponential signal modeling with zero-shot scan-specific neural\nnetwork regularization to enable high fidelity T1 and T2 mapping. T2 and T1\nmapping results demonstrate the ability of the proposed NLCG-Net to improve\nestimation quality compared to subspace reconstruction at high accelerations.\n","authors":["Xinrui Jiang","Yohan Jun","Jaejin Cho","Mengze Gao","Xingwang Yong","Berkin Bilgic"],"pdf_url":"https://arxiv.org/pdf/2401.12004v1.pdf","comment":"8 pages, 5 figures, submitted to International Society for Magnetic\n Resonance in Medicine 2024"},{"id":"http://arxiv.org/abs/2401.12002v1","updated":"2024-01-22T14:52:34Z","published":"2024-01-22T14:52:34Z","title":"HgbNet: predicting hemoglobin level/anemia degree from EHR data","summary":" Anemia is a prevalent medical condition that typically requires invasive\nblood tests for diagnosis and monitoring. Electronic health records (EHRs) have\nemerged as valuable data sources for numerous medical studies. EHR-based\nhemoglobin level/anemia degree prediction is non-invasive and rapid but still\nfaces some challenges due to the fact that EHR data is typically an irregular\nmultivariate time series containing a significant number of missing values and\nirregular time intervals. To address these issues, we introduce HgbNet, a\nmachine learning-based prediction model that emulates clinicians'\ndecision-making processes for hemoglobin level/anemia degree prediction. The\nmodel incorporates a NanDense layer with a missing indicator to handle missing\nvalues and employs attention mechanisms to account for both local irregularity\nand global irregularity. We evaluate the proposed method using two real-world\ndatasets across two use cases. In our first use case, we predict hemoglobin\nlevel/anemia degree at moment T+1 by utilizing records from moments prior to\nT+1. In our second use case, we integrate all historical records with\nadditional selected test results at moment T+1 to predict hemoglobin\nlevel/anemia degree at the same moment, T+1. HgbNet outperforms the best\nbaseline results across all datasets and use cases. These findings demonstrate\nthe feasibility of estimating hemoglobin levels and anemia degree from EHR\ndata, positioning HgbNet as an effective non-invasive anemia diagnosis solution\nthat could potentially enhance the quality of life for millions of affected\nindividuals worldwide. To our knowledge, HgbNet is the first machine learning\nmodel leveraging EHR data for hemoglobin level/anemia degree prediction.\n","authors":["Zhuo Zhi","Moe Elbadawi","Adam Daneshmend","Mine Orlu","Abdul Basit","Andreas Demosthenous","Miguel Rodrigues"],"pdf_url":"https://arxiv.org/pdf/2401.12002v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12000v1","updated":"2024-01-22T14:51:01Z","published":"2024-01-22T14:51:01Z","title":"Integrating Statistical Significance and Discriminative Power in Pattern\n Discovery","summary":" Pattern discovery plays a central role in both descriptive and predictive\ntasks across multiple domains. Actionable patterns must meet rigorous\nstatistical significance criteria and, in the presence of target variables,\nfurther uphold discriminative power. Our work addresses the underexplored area\nof guiding pattern discovery by integrating statistical significance and\ndiscriminative power criteria into state-of-the-art algorithms while preserving\npattern quality. We also address how pattern quality thresholds, imposed by\nsome algorithms, can be rectified to accommodate these additional criteria. To\ntest the proposed methodology, we select the triclustering task as the guiding\npattern discovery case and extend well-known greedy and multi-objective\noptimization triclustering algorithms, $\\delta$-Trimax and TriGen, that use\nvarious pattern quality criteria, such as Mean Squared Residual (MSR), Least\nSquared Lines (LSL), and Multi Slope Measure (MSL). Results from three case\nstudies show the role of the proposed methodology in discovering patterns with\npronounced improvements of discriminative power and statistical significance\nwithout quality deterioration, highlighting its importance in supervisedly\nguiding the search. Although the proposed methodology is motivated over\nmultivariate time series data, it can be straightforwardly extended to pattern\ndiscovery tasks involving multivariate, N-way (N>3), transactional, and\nsequential data structures.\n Availability: The code is freely available at\nhttps://github.com/JupitersMight/MOF_Triclustering under the MIT license.\n","authors":["Leonardo Alexandre","Rafael S. Costa","Rui Henriques"],"pdf_url":"https://arxiv.org/pdf/2401.12000v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11993v1","updated":"2024-01-22T14:46:41Z","published":"2024-01-22T14:46:41Z","title":"Expert-Driven Monitoring of Operational ML Models","summary":" We propose Expert Monitoring, an approach that leverages domain expertise to\nenhance the detection and mitigation of concept drift in machine learning (ML)\nmodels. Our approach supports practitioners by consolidating domain expertise\nrelated to concept drift-inducing events, making this expertise accessible to\non-call personnel, and enabling automatic adaptability with expert oversight.\n","authors":["Joran Leest","Claudia Raibulet","Ilias Gerostathopoulos","Patricia Lago"],"pdf_url":"https://arxiv.org/pdf/2401.11993v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11985v1","updated":"2024-01-22T14:38:25Z","published":"2024-01-22T14:38:25Z","title":"Scaling Face Interaction Graph Networks to Real World Scenes","summary":" Accurately simulating real world object dynamics is essential for various\napplications such as robotics, engineering, graphics, and design. To better\ncapture complex real dynamics such as contact and friction, learned simulators\nbased on graph networks have recently shown great promise. However, applying\nthese learned simulators to real scenes comes with two major challenges: first,\nscaling learned simulators to handle the complexity of real world scenes which\ncan involve hundreds of objects each with complicated 3D shapes, and second,\nhandling inputs from perception rather than 3D state information. Here we\nintroduce a method which substantially reduces the memory required to run\ngraph-based learned simulators. Based on this memory-efficient simulation\nmodel, we then present a perceptual interface in the form of editable NeRFs\nwhich can convert real-world scenes into a structured representation that can\nbe processed by graph network simulator. We show that our method uses\nsubstantially less memory than previous graph-based simulators while retaining\ntheir accuracy, and that the simulators learned in synthetic environments can\nbe applied to real world scenes captured from multiple camera angles. This\npaves the way for expanding the application of learned simulators to settings\nwhere only perceptual information is available at inference time.\n","authors":["Tatiana Lopez-Guevara","Yulia Rubanova","William F. Whitney","Tobias Pfaff","Kimberly Stachenfeld","Kelsey R. Allen"],"pdf_url":"https://arxiv.org/pdf/2401.11985v1.pdf","comment":"16 pages, 12 figures"},{"id":"http://arxiv.org/abs/2401.11974v1","updated":"2024-01-22T14:26:02Z","published":"2024-01-22T14:26:02Z","title":"Cross-Validation Conformal Risk Control","summary":" Conformal risk control (CRC) is a recently proposed technique that applies\npost-hoc to a conventional point predictor to provide calibration guarantees.\nGeneralizing conformal prediction (CP), with CRC, calibration is ensured for a\nset predictor that is extracted from the point predictor to control a risk\nfunction such as the probability of miscoverage or the false negative rate. The\noriginal CRC requires the available data set to be split between training and\nvalidation data sets. This can be problematic when data availability is\nlimited, resulting in inefficient set predictors. In this paper, a novel CRC\nmethod is introduced that is based on cross-validation, rather than on\nvalidation as the original CRC. The proposed cross-validation CRC (CV-CRC)\nextends a version of the jackknife-minmax from CP to CRC, allowing for the\ncontrol of a broader range of risk functions. CV-CRC is proved to offer\ntheoretical guarantees on the average risk of the set predictor. Furthermore,\nnumerical experiments show that CV-CRC can reduce the average set size with\nrespect to CRC when the available data are limited.\n","authors":["Kfir M. Cohen","Sangwoo Park","Osvaldo Simeone","Shlomo Shamai"],"pdf_url":"https://arxiv.org/pdf/2401.11974v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.08738v2","updated":"2024-01-22T14:17:27Z","published":"2024-01-16T18:31:23Z","title":"Machine Learning-Based Analysis of Ebola Virus' Impact on Gene\n Expression in Nonhuman Primates","summary":" This study introduces the Supervised Magnitude-Altitude Scoring (SMAS)\nmethodology, a machine learning-based approach, for analyzing gene expression\ndata obtained from nonhuman primates (NHPs) infected with Ebola virus (EBOV).\nWe utilize a comprehensive dataset of NanoString gene expression profiles from\nEbola-infected NHPs, deploying the SMAS system for nuanced host-pathogen\ninteraction analysis. SMAS effectively combines gene selection based on\nstatistical significance and expression changes, employing linear classifiers\nsuch as logistic regression to accurately differentiate between RT-qPCR\npositive and negative NHP samples. A key finding of our research is the\nidentification of IFI6 and IFI27 as critical biomarkers, demonstrating\nexceptional predictive performance with 100% accuracy and Area Under the Curve\n(AUC) metrics in classifying various stages of Ebola infection. Alongside IFI6\nand IFI27, genes, including MX1, OAS1, and ISG15, were significantly\nupregulated, highlighting their essential roles in the immune response to EBOV.\nOur results underscore the efficacy of the SMAS method in revealing complex\ngenetic interactions and response mechanisms during EBOV infection. This\nresearch provides valuable insights into EBOV pathogenesis and aids in\ndeveloping more precise diagnostic tools and therapeutic strategies to address\nEBOV infection in particular and viral infection in general.\n","authors":["Mostafa Rezapour","Muhammad Khalid Khan Niazi","Hao Lu","Aarthi Narayanan","Metin Nafi Gurcan"],"pdf_url":"https://arxiv.org/pdf/2401.08738v2.pdf","comment":"28 pages, 8 figures, 2 tables"},{"id":"http://arxiv.org/abs/2401.10451v2","updated":"2024-01-22T14:14:16Z","published":"2024-01-19T01:40:58Z","title":"Learning-assisted Stochastic Capacity Expansion Planning: A Bayesian\n Optimization Approach","summary":" Solving large-scale capacity expansion problems (CEPs) is central to\ncost-effective decarbonization of regional-scale energy systems. To ensure the\nintended outcomes of CEPs, modeling uncertainty due to weather-dependent\nvariable renewable energy (VRE) supply and energy demand becomes crucially\nimportant. However, the resulting stochastic optimization models are often less\ncomputationally tractable than their deterministic counterparts. Here, we\npropose a learning-assisted approximate solution method to tractably solve\ntwo-stage stochastic CEPs. Our method identifies low-cost planning decisions by\nconstructing and solving a sequence of tractable temporally aggregated\nsurrogate problems. We adopt a Bayesian optimization approach to searching the\nspace of time series aggregation hyperparameters and compute approximate\nsolutions that minimize costs on a validation set of supply-demand projections.\nImportantly, we evaluate solved planning outcomes on a held-out set of test\nprojections. We apply our approach to generation and transmission expansion\nplanning for a joint power-gas system spanning New England. We show that our\napproach yields an estimated cost savings of up to 3.8% in comparison to\nbenchmark time series aggregation approaches.\n","authors":["Aron Brenner","Rahman Khorramfar","Dharik Mallapragada","Saurabh Amin"],"pdf_url":"https://arxiv.org/pdf/2401.10451v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11963v1","updated":"2024-01-22T14:06:37Z","published":"2024-01-22T14:06:37Z","title":"Bridging Evolutionary Algorithms and Reinforcement Learning: A\n Comprehensive Survey","summary":" Evolutionary Reinforcement Learning (ERL), which integrates Evolutionary\nAlgorithms (EAs) and Reinforcement Learning (RL) for optimization, has\ndemonstrated remarkable performance advancements. By fusing the strengths of\nboth approaches, ERL has emerged as a promising research direction. This survey\noffers a comprehensive overview of the diverse research branches in ERL.\nSpecifically, we systematically summarize recent advancements in relevant\nalgorithms and identify three primary research directions: EA-assisted\noptimization of RL, RL-assisted optimization of EA, and synergistic\noptimization of EA and RL. Following that, we conduct an in-depth analysis of\neach research direction, organizing multiple research branches. We elucidate\nthe problems that each branch aims to tackle and how the integration of EA and\nRL addresses these challenges. In conclusion, we discuss potential challenges\nand prospective future research directions across various research directions.\n","authors":["Pengyi Li","Jianye Hao","Hongyao Tang","Xian Fu","Yan Zheng","Ke Tang"],"pdf_url":"https://arxiv.org/pdf/2401.11963v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11954v1","updated":"2024-01-22T13:54:26Z","published":"2024-01-22T13:54:26Z","title":"RUMBoost: Gradient Boosted Random Utility Models","summary":" This paper introduces the RUMBoost model, a novel discrete choice modelling\napproach that combines the interpretability and behavioural robustness of\nRandom Utility Models (RUMs) with the generalisation and predictive ability of\ndeep learning methods. We obtain the full functional form of non-linear utility\nspecifications by replacing each linear parameter in the utility functions of a\nRUM with an ensemble of gradient boosted regression trees. This enables\npiece-wise constant utility values to be imputed for all alternatives directly\nfrom the data for any possible combination of input variables. We introduce\nadditional constraints on the ensembles to ensure three crucial features of the\nutility specifications: (i) dependency of the utilities of each alternative on\nonly the attributes of that alternative, (ii) monotonicity of marginal\nutilities, and (iii) an intrinsically interpretable functional form, where the\nexact response of the model is known throughout the entire input space.\nFurthermore, we introduce an optimisation-based smoothing technique that\nreplaces the piece-wise constant utility values of alternative attributes with\nmonotonic piece-wise cubic splines to identify non-linear parameters with\ndefined gradient. We demonstrate the potential of the RUMBoost model compared\nto various ML and Random Utility benchmark models for revealed preference mode\nchoice data from London. The results highlight the great predictive performance\nand the direct interpretability of our proposed approach. Furthermore, the\nsmoothed attribute utility functions allow for the calculation of various\nbehavioural indicators and marginal utilities. Finally, we demonstrate the\nflexibility of our methodology by showing how the RUMBoost model can be\nextended to complex model specifications, including attribute interactions,\ncorrelation within alternative error terms and heterogeneity within the\npopulation.\n","authors":["Nicolas Salvadé","Tim Hillel"],"pdf_url":"https://arxiv.org/pdf/2401.11954v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2109.04033v4","updated":"2024-01-22T13:53:09Z","published":"2021-09-09T04:48:54Z","title":"New Versions of Gradient Temporal Difference Learning","summary":" Sutton, Szepesv\\'{a}ri and Maei introduced the first gradient\ntemporal-difference (GTD) learning algorithms compatible with both linear\nfunction approximation and off-policy training. The goal of this paper is (a)\nto propose some variants of GTDs with extensive comparative analysis and (b) to\nestablish new theoretical analysis frameworks for the GTDs. These variants are\nbased on convex-concave saddle-point interpretations of GTDs, which effectively\nunify all the GTDs into a single framework, and provide simple stability\nanalysis based on recent results on primal-dual gradient dynamics. Finally,\nnumerical comparative analysis is given to evaluate these approaches.\n","authors":["Donghwan Lee","Han-Dong Lim","Jihoon Park","Okyong Choi"],"pdf_url":"https://arxiv.org/pdf/2109.04033v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.03242v2","updated":"2024-01-22T13:40:16Z","published":"2023-11-06T16:31:09Z","title":"Approximating Langevin Monte Carlo with ResNet-like Neural Network\n architectures","summary":" We sample from a given target distribution by constructing a neural network\nwhich maps samples from a simple reference, e.g. the standard normal\ndistribution, to samples from the target. To that end, we propose using a\nneural network architecture inspired by the Langevin Monte Carlo (LMC)\nalgorithm. Based on LMC perturbation results, we show approximation rates of\nthe proposed architecture for smooth, log-concave target distributions measured\nin the Wasserstein-$2$ distance. The analysis heavily relies on the notion of\nsub-Gaussianity of the intermediate measures of the perturbed LMC process. In\nparticular, we derive bounds on the growth of the intermediate variance proxies\nunder different assumptions on the perturbations. Moreover, we propose an\narchitecture similar to deep residual neural networks and derive expressivity\nresults for approximating the sample to target distribution map.\n","authors":["Charles Miranda","Janina Schütte","David Sommer","Martin Eigel"],"pdf_url":"https://arxiv.org/pdf/2311.03242v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10107v2","updated":"2024-01-22T13:36:12Z","published":"2024-01-18T16:18:18Z","title":"Comparison analysis between standard polysomnographic data and\n in-ear-EEG signals: A preliminary study","summary":" Study Objectives: Polysomnography (PSG) currently serves as the benchmark for\nevaluating sleep disorders. Its discomfort, impracticality for home-use, and\nintroduction of bias in sleep quality assessment necessitate the exploration of\nless invasive, cost-effective, and portable alternatives. One promising\ncontender is the in-ear-EEG sensor, which offers advantages in terms of\ncomfort, fixed electrode positions, resistance to electromagnetic interference,\nand user-friendliness. This study aims to establish a methodology to assess the\nsimilarity between the in-ear-EEG signal and standard PSG.\n Methods: We assess the agreement between the PSG and in-ear-EEG derived\nhypnograms. We extract features in the time- and frequency- domain from PSG and\nin-ear-EEG 30-second epochs. We only consider the epochs where the PSG-scorers\nand the in-ear-EEG-scorers were in agreement. We introduce a methodology to\nquantify the similarity between PSG derivations and the single-channel\nin-ear-EEG. The approach relies on a comparison of distributions of selected\nfeatures -- extracted for each sleep stage and subject on both PSG and the\nin-ear-EEG signals -- via a Jensen-Shannon Divergence Feature-based Similarity\nIndex (JSD-FSI).\n Results: We found a high intra-scorer variability, mainly due to the\nuncertainty the scorers had in evaluating the in-ear-EEG signals. We show that\nthe similarity between PSG and in-ear-EEG signals is high (JSD-FSI: 0.61 +/-\n0.06 in awake, 0.60 +/- 0.07 in NREM and 0.51 +/- 0.08 in REM), and in line\nwith the similarity values computed independently on standard\nPSG-channel-combinations.\n Conclusions: In-ear-EEG is a valuable solution for home-based sleep\nmonitoring, however further studies with a larger and more heterogeneous\ndataset are needed.\n","authors":["Gianpaolo Palo","Luigi Fiorillo","Giuliana Monachino","Michal Bechny","Mark Melnykowycz","Athina Tzovara","Valentina Agostini","Francesca Dalia Faraci"],"pdf_url":"https://arxiv.org/pdf/2401.10107v2.pdf","comment":"29 pages, 12 figures, 1 table"},{"id":"http://arxiv.org/abs/2401.11943v1","updated":"2024-01-22T13:33:53Z","published":"2024-01-22T13:33:53Z","title":"Benchmarking Large Multimodal Models against Common Corruptions","summary":" This technical report aims to fill a deficiency in the assessment of large\nmultimodal models (LMMs) by specifically examining the self-consistency of\ntheir outputs when subjected to common corruptions. We investigate the\ncross-modal interactions between text, image, and speech, encompassing four\nessential generation tasks: text-to-image, image-to-text, text-to-speech, and\nspeech-to-text. We create a comprehensive benchmark, named MMCBench, that\ncovers more than 100 popular LMMs (totally over 150 model checkpoints). A\nthorough evaluation under common corruptions is critical for practical\ndeployment and facilitates a better understanding of the reliability of\ncutting-edge LMMs. The benchmarking code is available at\nhttps://github.com/sail-sg/MMCBench\n","authors":["Jiawei Zhang","Tianyu Pang","Chao Du","Yi Ren","Bo Li","Min Lin"],"pdf_url":"https://arxiv.org/pdf/2401.11943v1.pdf","comment":"Technical report"},{"id":"http://arxiv.org/abs/2401.11940v1","updated":"2024-01-22T13:30:11Z","published":"2024-01-22T13:30:11Z","title":"Low-Tubal-Rank Tensor Recovery via Factorized Gradient Descent","summary":" This paper considers the problem of recovering a tensor with an underlying\nlow-tubal-rank structure from a small number of corrupted linear measurements.\nTraditional approaches tackling such a problem require the computation of\ntensor Singular Value Decomposition (t-SVD), that is a computationally\nintensive process, rendering them impractical for dealing with large-scale\ntensors. Aim to address this challenge, we propose an efficient and effective\nlow-tubal-rank tensor recovery method based on a factorization procedure akin\nto the Burer-Monteiro (BM) method. Precisely, our fundamental approach involves\ndecomposing a large tensor into two smaller factor tensors, followed by solving\nthe problem through factorized gradient descent (FGD). This strategy eliminates\nthe need for t-SVD computation, thereby reducing computational costs and\nstorage requirements. We provide rigorous theoretical analysis to ensure the\nconvergence of FGD under both noise-free and noisy situations. Additionally, it\nis worth noting that our method does not require the precise estimation of the\ntensor tubal-rank. Even in cases where the tubal-rank is slightly\noverestimated, our approach continues to demonstrate robust performance. A\nseries of experiments have been carried out to demonstrate that, as compared to\nother popular ones, our approach exhibits superior performance in multiple\nscenarios, in terms of the faster computational speed and the smaller\nconvergence error.\n","authors":["Zhiyu Liu","Zhi Han","Yandong Tang","Xi-Le Zhao","Yao Wang"],"pdf_url":"https://arxiv.org/pdf/2401.11940v1.pdf","comment":"13 pages, 4 figures"},{"id":"http://arxiv.org/abs/2401.11929v1","updated":"2024-01-22T13:15:40Z","published":"2024-01-22T13:15:40Z","title":"The Bigger the Better? Rethinking the Effective Model Scale in Long-term\n Time Series Forecasting","summary":" Long-term time series forecasting (LTSF) represents a critical frontier in\ntime series analysis, distinguished by its focus on extensive input sequences,\nin contrast to the constrained lengths typical of traditional approaches. While\nlonger sequences inherently convey richer information, potentially enhancing\npredictive precision, prevailing techniques often respond by escalating model\ncomplexity. These intricate models can inflate into millions of parameters,\nincorporating parameter-intensive elements like positional encodings,\nfeed-forward networks and self-attention mechanisms. This complexity, however,\nleads to prohibitive model scale, particularly given the time series data's\nsemantic simplicity. Motivated by the pursuit of parsimony, our research\nemploys conditional correlation and auto-correlation as investigative tools,\nrevealing significant redundancies within the input data. Leveraging these\ninsights, we introduce the HDformer, a lightweight Transformer variant enhanced\nwith hierarchical decomposition. This novel architecture not only inverts the\nprevailing trend toward model expansion but also accomplishes precise\nforecasting with drastically fewer computations and parameters. Remarkably,\nHDformer outperforms existing state-of-the-art LTSF models, while requiring\nover 99\\% fewer parameters. Through this work, we advocate a paradigm shift in\nLTSF, emphasizing the importance to tailor the model to the inherent dynamics\nof time series data-a timely reminder that in the realm of LTSF, bigger is not\ninvariably better.\n","authors":["Jinliang Deng","Xuan Song","Ivor W. Tsang","Hui Xiong"],"pdf_url":"https://arxiv.org/pdf/2401.11929v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09126v2","updated":"2024-01-22T13:14:33Z","published":"2023-10-13T14:14:43Z","title":"Physics-guided Noise Neural Proxy for Practical Low-light Raw Image\n Denoising","summary":" Recently, the mainstream practice for training low-light raw image denoising\nmethods has shifted towards employing synthetic data. Noise modeling, which\nfocuses on characterizing the noise distribution of real-world sensors,\nprofoundly influences the effectiveness and practicality of synthetic data.\nCurrently, physics-based noise modeling struggles to characterize the entire\nreal noise distribution, while learning-based noise modeling impractically\ndepends on paired real data. In this paper, we propose a novel strategy:\nlearning the noise model from dark frames instead of paired real data, to break\ndown the data dependency. Based on this strategy, we introduce an efficient\nphysics-guided noise neural proxy (PNNP) to approximate the real-world sensor\nnoise model. Specifically, we integrate physical priors into neural proxies and\nintroduce three efficient techniques: physics-guided noise decoupling (PND),\nphysics-guided proxy model (PPM), and differentiable distribution loss (DDL).\nPND decouples the dark frame into different components and handles different\nlevels of noise flexibly, which reduces the complexity of noise modeling. PPM\nincorporates physical priors to constrain the generated noise, which promotes\nthe accuracy of noise modeling. DDL provides explicit and reliable supervision\nfor noise distribution, which promotes the precision of noise modeling. PNNP\nexhibits powerful potential in characterizing the real noise distribution.\nExtensive experiments on public datasets demonstrate superior performance in\npractical low-light raw image denoising. The code will be available at\n\\url{https://github.com/fenghansen/PNNP}.\n","authors":["Hansen Feng","Lizhi Wang","Yiqi Huang","Yuzhi Wang","Lin Zhu","Hua Huang"],"pdf_url":"https://arxiv.org/pdf/2310.09126v2.pdf","comment":"Under Review"},{"id":"http://arxiv.org/abs/2401.10337v2","updated":"2024-01-22T12:33:43Z","published":"2024-01-18T19:02:00Z","title":"Noise Contrastive Estimation-based Matching Framework for Low-resource\n Security Attack Pattern Recognition","summary":" Tactics, Techniques and Procedures (TTPs) represent sophisticated attack\npatterns in the cybersecurity domain, described encyclopedically in textual\nknowledge bases. Identifying TTPs in cybersecurity writing, often called TTP\nmapping, is an important and challenging task. Conventional learning approaches\noften target the problem in the classical multi-class or multilabel\nclassification setting. This setting hinders the learning ability of the model\ndue to a large number of classes (i.e., TTPs), the inevitable skewness of the\nlabel distribution and the complex hierarchical structure of the label space.\nWe formulate the problem in a different learning paradigm, where the assignment\nof a text to a TTP label is decided by the direct semantic similarity between\nthe two, thus reducing the complexity of competing solely over the large\nlabeling space. To that end, we propose a neural matching architecture with an\neffective sampling-based learn-to-compare mechanism, facilitating the learning\nprocess of the matching model despite constrained resources.\n","authors":["Tu Nguyen","Nedim Srndic","Alexander Neth"],"pdf_url":"https://arxiv.org/pdf/2401.10337v2.pdf","comment":"accepted at EACL 2024, in ARR October 2023"},{"id":"http://arxiv.org/abs/2401.11888v1","updated":"2024-01-22T12:28:50Z","published":"2024-01-22T12:28:50Z","title":"Multimodal Deep Learning of Word-of-Mouth Text and Demographics to\n Predict Customer Rating: Handling Consumer Heterogeneity in Marketing","summary":" In the marketing field, understanding consumer heterogeneity, which is the\ninternal or psychological difference among consumers that cannot be captured by\nbehavioral logs, has long been a critical challenge. However, a number of\nconsumers today usually post their evaluation on the specific product on the\nonline platform, which can be the valuable source of such unobservable\ndifferences among consumers. Several previous studies have shown the validity\nof the analysis on text modality, but on the other hand, such analyses may not\nnecessarily demonstrate sufficient predictive accuracy for text alone, as they\nmay not include information readily available from cross-sectional data, such\nas consumer profile data. In addition, recent advances in machine learning\ntechniques, such as large-scale language models (LLMs) and multimodal learning\nhave made it possible to deal with the various kind of dataset simultaneously,\nincluding textual data and the traditional cross-sectional data, and the joint\nrepresentations can be effectively obtained from multiple modalities.\nTherefore, this study constructs a product evaluation model that takes into\naccount consumer heterogeneity by multimodal learning of online product reviews\nand consumer profile information. We also compare multiple models using\ndifferent modalities or hyper-parameters to demonstrate the robustness of\nmultimodal learning in marketing analysis.\n","authors":["Junichiro Niimi"],"pdf_url":"https://arxiv.org/pdf/2401.11888v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2206.15269v3","updated":"2024-01-22T12:26:44Z","published":"2022-06-30T13:20:48Z","title":"Deep Reinforcement Learning with Swin Transformers","summary":" Transformers are neural network models that utilize multiple layers of\nself-attention heads and have exhibited enormous potential in natural language\nprocessing tasks. Meanwhile, there have been efforts to adapt transformers to\nvisual tasks of machine learning, including Vision Transformers and Swin\nTransformers. Although some researchers use Vision Transformers for\nreinforcement learning tasks, their experiments remain at a small scale due to\nthe high computational cost. This article presents the first online\nreinforcement learning scheme that is based on Swin Transformers: Swin DQN. In\ncontrast to existing research, our novel approach demonstrate the superior\nperformance with experiments on 49 games in the Arcade Learning Environment.\nThe results show that our approach achieves significantly higher maximal\nevaluation scores than the baseline method in 45 of all the 49 games (92%), and\nhigher mean evaluation scores than the baseline method in 40 of all the 49\ngames (82%).\n","authors":["Li Meng","Morten Goodwin","Anis Yazidi","Paal Engelstad"],"pdf_url":"https://arxiv.org/pdf/2206.15269v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10393v2","updated":"2024-01-22T12:04:18Z","published":"2024-01-18T22:06:38Z","title":"Catastrophic Interference is Mitigated in Naturalistic Power-Law\n Learning Environments","summary":" Neural networks often suffer from catastrophic interference (CI): performance\non previously learned tasks drops off significantly when learning a new task.\nThis contrasts strongly with humans, who can sequentially learn new tasks\nwithout appreciably forgetting previous tasks. Prior work has explored various\ntechniques for mitigating CI such as regularization, rehearsal, generative\nreplay, and distillation methods. The current work takes a different approach,\none guided by cognitive science research showing that in naturalistic\nenvironments, the probability of encountering a task decreases as a power-law\nof the time since it was last performed. We argue that a realistic evaluation\nof techniques for the mitigation of CI should be performed in simulated\nnaturalistic learning environments. Thus, we evaluate the extent of mitigation\nof CI when training simple rehearsal-based methods in power-law environments\nsimilar to the ones humans face. Our work explores this novel rehearsal-based\napproach for a domain-incremental task: learning permutations in the MNIST\ntask. We compare our rehearsal environment with other baselines to show its\nefficacy in promoting continual learning. Additionally, we investigate whether\nthis environment shows forward facilitation, i.e., faster learning of later\ntasks. Next, we explore the robustness of our learning environment to the\nnumber of tasks, model size, and amount of data rehearsed after each task.\nNotably, our results show that the performance is comparable or superior to\nthat of models trained using popular regularization methods and also to\nrehearsals in non-power-law environments. The benefits of this training\nparadigm include simplicity and the lack of a need for extra neural circuitry.\nIn addition, because our method is orthogonal to other methods, future research\ncan combine training in power-law environments with other continual learning\nmechanisms.\n","authors":["Atith Gandhi","Raj Sanjay Shah","Vijay Marupudi","Sashank Varma"],"pdf_url":"https://arxiv.org/pdf/2401.10393v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.06064v4","updated":"2024-01-22T12:04:06Z","published":"2023-05-18T13:59:02Z","title":"Neural Algorithmic Reasoning for Combinatorial Optimisation","summary":" Solving NP-hard/complete combinatorial problems with neural networks is a\nchallenging research area that aims to surpass classical approximate\nalgorithms. The long-term objective is to outperform hand-designed heuristics\nfor NP-hard/complete problems by learning to generate superior solutions solely\nfrom training data. Current neural-based methods for solving CO problems often\noverlook the inherent \"algorithmic\" nature of the problems. In contrast,\nheuristics designed for CO problems, e.g. TSP, frequently leverage\nwell-established algorithms, such as those for finding the minimum spanning\ntree. In this paper, we propose leveraging recent advancements in neural\nalgorithmic reasoning to improve the learning of CO problems. Specifically, we\nsuggest pre-training our neural model on relevant algorithms before training it\non CO instances. Our results demonstrate that by using this learning setup, we\nachieve superior performance compared to non-algorithmically informed deep\nlearning models.\n","authors":["Dobrik Georgiev","Danilo Numeroso","Davide Bacciu","Pietro Liò"],"pdf_url":"https://arxiv.org/pdf/2306.06064v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.04073v2","updated":"2024-01-22T12:00:58Z","published":"2023-05-06T15:26:22Z","title":"Explaining RL Decisions with Trajectories","summary":" Explanation is a key component for the adoption of reinforcement learning\n(RL) in many real-world decision-making problems. In the literature, the\nexplanation is often provided by saliency attribution to the features of the RL\nagent's state. In this work, we propose a complementary approach to these\nexplanations, particularly for offline RL, where we attribute the policy\ndecisions of a trained RL agent to the trajectories encountered by it during\ntraining. To do so, we encode trajectories in offline training data\nindividually as well as collectively (encoding a set of trajectories). We then\nattribute policy decisions to a set of trajectories in this encoded space by\nestimating the sensitivity of the decision with respect to that set. Further,\nwe demonstrate the effectiveness of the proposed approach in terms of quality\nof attributions as well as practical scalability in diverse environments that\ninvolve both discrete and continuous state and action spaces such as\ngrid-worlds, video games (Atari) and continuous control (MuJoCo). We also\nconduct a human study on a simple navigation task to observe how their\nunderstanding of the task compares with data attributed for a trained RL\npolicy. Keywords -- Explainable AI, Verifiability of AI Decisions, Explainable\nRL.\n","authors":["Shripad Vilasrao Deshmukh","Arpan Dasgupta","Balaji Krishnamurthy","Nan Jiang","Chirag Agarwal","Georgios Theocharous","Jayakumar Subramanian"],"pdf_url":"https://arxiv.org/pdf/2305.04073v2.pdf","comment":"Published at International Conference on Learning Representations\n (ICLR), 2023"},{"id":"http://arxiv.org/abs/2210.00108v3","updated":"2024-01-22T11:51:29Z","published":"2022-09-30T21:59:24Z","title":"ImpNet: Imperceptible and blackbox-undetectable backdoors in compiled\n neural networks","summary":" Early backdoor attacks against machine learning set off an arms race in\nattack and defence development. Defences have since appeared demonstrating some\nability to detect backdoors in models or even remove them. These defences work\nby inspecting the training data, the model, or the integrity of the training\nprocedure. In this work, we show that backdoors can be added during\ncompilation, circumventing any safeguards in the data preparation and model\ntraining stages. The attacker can not only insert existing weight-based\nbackdoors during compilation, but also a new class of weight-independent\nbackdoors, such as ImpNet. These backdoors are impossible to detect during the\ntraining or data preparation processes, because they are not yet present. Next,\nwe demonstrate that some backdoors, including ImpNet, can only be reliably\ndetected at the stage where they are inserted and removing them anywhere else\npresents a significant challenge. We conclude that ML model security requires\nassurance of provenance along the entire technical pipeline, including the\ndata, model architecture, compiler, and hardware specification.\n","authors":["Tim Clifford","Ilia Shumailov","Yiren Zhao","Ross Anderson","Robert Mullins"],"pdf_url":"https://arxiv.org/pdf/2210.00108v3.pdf","comment":"10 pages, 7 figures, to be published in IEEE Secure and Trustworthy\n Machine Learning 2024. For website see https://ml.backdoors.uk . For source\n code, see https://git.sr.ht/~tim-clifford/impnet_source"},{"id":"http://arxiv.org/abs/2401.11860v1","updated":"2024-01-22T11:29:44Z","published":"2024-01-22T11:29:44Z","title":"A Review of Physics-Informed Machine Learning Methods with Applications\n to Condition Monitoring and Anomaly Detection","summary":" This study presents a comprehensive overview of PIML techniques in the\ncontext of condition monitoring. The central concept driving PIML is the\nincorporation of known physical laws and constraints into machine learning\nalgorithms, enabling them to learn from available data while remaining\nconsistent with physical principles. Through fusing domain knowledge with\ndata-driven learning, PIML methods offer enhanced accuracy and interpretability\nin comparison to purely data-driven approaches. In this comprehensive survey,\ndetailed examinations are performed with regard to the methodology by which\nknown physical principles are integrated within machine learning frameworks, as\nwell as their suitability for specific tasks within condition monitoring.\nIncorporation of physical knowledge into the ML model may be realized in a\nvariety of methods, with each having its unique advantages and drawbacks. The\ndistinct advantages and limitations of each methodology for the integration of\nphysics within data-driven models are detailed, considering factors such as\ncomputational efficiency, model interpretability, and generalizability to\ndifferent systems in condition monitoring and fault detection. Several case\nstudies and works of literature utilizing this emerging concept are presented\nto demonstrate the efficacy of PIML in condition monitoring applications. From\nthe literature reviewed, the versatility and potential of PIML in condition\nmonitoring may be demonstrated. Novel PIML methods offer an innovative solution\nfor addressing the complexities of condition monitoring and associated\nchallenges. This comprehensive survey helps form the foundation for future work\nin the field. As the technology continues to advance, PIML is expected to play\na crucial role in enhancing maintenance strategies, system reliability, and\noverall operational efficiency in engineering systems.\n","authors":["Yuandi Wu","Brett Sicard","Stephen Andrew Gadsden"],"pdf_url":"https://arxiv.org/pdf/2401.11860v1.pdf","comment":"Paper has been submitted for review to the journal Expert Systems\n with Applications (December 31, 2023). 90 pages, 22 figures, 9 tables"},{"id":"http://arxiv.org/abs/2309.16034v2","updated":"2024-01-22T11:26:35Z","published":"2023-09-27T21:26:01Z","title":"Analytical Modelling of Raw Data for Flow-Guided In-body Nanoscale\n Localization","summary":" Advancements in nanotechnology and material science are paving the way toward\nnanoscale devices that combine sensing, computing, data and energy storage, and\nwireless communication. In precision medicine, these nanodevices show promise\nfor disease diagnostics, treatment, and monitoring from within the patients'\nbloodstreams. Assigning the location of a sensed biological event with the\nevent itself, which is the main proposition of flow-guided in-body nanoscale\nlocalization, would be immensely beneficial from the perspective of precision\nmedicine. The nanoscale nature of the nanodevices and the challenging\nenvironment that the bloodstream represents, result in current flow-guided\nlocalization approaches being constrained in their communication and\nenergy-related capabilities. The communication and energy constraints of the\nnanodevices result in different features of raw data for flow-guided\nlocalization, in turn affecting its performance. An analytical modeling of the\neffects of imperfect communication and constrained energy causing intermittent\noperation of the nanodevices on the raw data produced by the nanodevices would\nbe beneficial. Hence, we propose an analytical model of raw data for\nflow-guided localization, where the raw data is modeled as a function of\ncommunication and energy-related capabilities of the nanodevice. We evaluate\nthe model by comparing its output with the one obtained through the utilization\nof a simulator for objective evaluation of flow-guided localization, featuring\ncomparably higher level of realism. Our results across a number of scenarios\nand heterogeneous performance metrics indicate high similarity between the\nmodel and simulator-generated raw datasets.\n","authors":["Guillem Pascual","Filip Lemic","Carmen Delgado","Xavier Costa-Perez"],"pdf_url":"https://arxiv.org/pdf/2309.16034v2.pdf","comment":"6 pages, 7 figures, 4 tables, 16 references"},{"id":"http://arxiv.org/abs/2309.10688v3","updated":"2024-01-22T11:26:17Z","published":"2023-09-19T15:23:07Z","title":"On the different regimes of Stochastic Gradient Descent","summary":" Modern deep networks are trained with stochastic gradient descent (SGD) whose\nkey hyperparameters are the number of data considered at each step or batch\nsize $B$, and the step size or learning rate $\\eta$. For small $B$ and large\n$\\eta$, SGD corresponds to a stochastic evolution of the parameters, whose\nnoise amplitude is governed by the `temperature' $T\\equiv \\eta/B$. Yet this\ndescription is observed to break down for sufficiently large batches $B\\geq\nB^*$, or simplifies to gradient descent (GD) when the temperature is\nsufficiently small. Understanding where these cross-overs take place remains a\ncentral challenge. Here, we resolve these questions for a teacher-student\nperceptron classification model and show empirically that our key predictions\nstill apply to deep networks. Specifically, we obtain a phase diagram in the\n$B$-$\\eta$ plane that separates three dynamical phases: \\textit{(i)} a\nnoise-dominated SGD governed by temperature, \\textit{(ii)} a\nlarge-first-step-dominated SGD and \\textit{(iii)} GD. These different phases\nalso correspond to different regimes of generalization error. Remarkably, our\nanalysis reveals that the batch size $B^*$ separating regimes \\textit{(i)} and\n\\textit{(ii)} scale with the size $P$ of the training set, with an exponent\nthat characterizes the hardness of the classification problem.\n","authors":["Antonio Sclocchi","Matthieu Wyart"],"pdf_url":"https://arxiv.org/pdf/2309.10688v3.pdf","comment":"Main: 8 pages, 4 figures; Appendix: 20 pages, 10 figures"},{"id":"http://arxiv.org/abs/2308.09647v2","updated":"2024-01-22T11:14:39Z","published":"2023-08-18T16:07:01Z","title":"Robust Uncertainty Quantification Using Conformalised Monte Carlo\n Prediction","summary":" Deploying deep learning models in safety-critical applications remains a very\nchallenging task, mandating the provision of assurances for the dependable\noperation of these models. Uncertainty quantification (UQ) methods estimate the\nmodel's confidence per prediction, informing decision-making by considering the\neffect of randomness and model misspecification. Despite the advances of\nstate-of-the-art UQ methods, they are computationally expensive or produce\nconservative prediction sets/intervals. We introduce MC-CP, a novel hybrid UQ\nmethod that combines a new adaptive Monte Carlo (MC) dropout method with\nconformal prediction (CP). MC-CP adaptively modulates the traditional MC\ndropout at runtime to save memory and computation resources, enabling\npredictions to be consumed by CP, yielding robust prediction sets/intervals.\nThroughout comprehensive experiments, we show that MC-CP delivers significant\nimprovements over advanced UQ methods, like MC dropout, RAPS and CQR, both in\nclassification and regression benchmarks. MC-CP can be easily added to existing\nmodels, making its deployment simple.\n","authors":["Daniel Bethell","Simos Gerasimou","Radu Calinescu"],"pdf_url":"https://arxiv.org/pdf/2308.09647v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11849v1","updated":"2024-01-22T11:08:36Z","published":"2024-01-22T11:08:36Z","title":"Self-Labeling the Job Shop Scheduling Problem","summary":" In this work, we propose a Self-Supervised training strategy specifically\ndesigned for combinatorial problems. One of the main obstacles in applying\nsupervised paradigms to such problems is the requirement of expensive target\nsolutions as ground-truth, often produced with costly exact solvers. Inspired\nby Semi- and Self-Supervised learning, we show that it is possible to easily\ntrain generative models by sampling multiple solutions and using the best one\naccording to the problem objective as a pseudo-label. In this way, we\niteratively improve the model generation capability by relying only on its\nself-supervision, completely removing the need for optimality information. We\nprove the effectiveness of this Self-Labeling strategy on the Job Shop\nScheduling (JSP), a complex combinatorial problem that is receiving much\nattention from the Reinforcement Learning community. We propose a generative\nmodel based on the well-known Pointer Network and train it with our strategy.\nExperiments on two popular benchmarks demonstrate the potential of this\napproach as the resulting models outperform constructive heuristics and current\nstate-of-the-art Reinforcement Learning proposals.\n","authors":["Andrea Corsini","Angelo Porrello","Simone Calderara","Mauro Dell'Amico"],"pdf_url":"https://arxiv.org/pdf/2401.11849v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11844v1","updated":"2024-01-22T11:01:52Z","published":"2024-01-22T11:01:52Z","title":"Adaptive Fusion of Multi-view Remote Sensing data for Optimal Sub-field\n Crop Yield Prediction","summary":" Accurate crop yield prediction is of utmost importance for informed\ndecision-making in agriculture, aiding farmers, and industry stakeholders.\nHowever, this task is complex and depends on multiple factors, such as\nenvironmental conditions, soil properties, and management practices. Combining\nheterogeneous data views poses a fusion challenge, like identifying the\nview-specific contribution to the predictive task. We present a novel\nmulti-view learning approach to predict crop yield for different crops\n(soybean, wheat, rapeseed) and regions (Argentina, Uruguay, and Germany). Our\nmulti-view input data includes multi-spectral optical images from Sentinel-2\nsatellites and weather data as dynamic features during the crop growing season,\ncomplemented by static features like soil properties and topographic\ninformation. To effectively fuse the data, we introduce a Multi-view Gated\nFusion (MVGF) model, comprising dedicated view-encoders and a Gated Unit (GU)\nmodule. The view-encoders handle the heterogeneity of data sources with varying\ntemporal resolutions by learning a view-specific representation. These\nrepresentations are adaptively fused via a weighted sum. The fusion weights are\ncomputed for each sample by the GU using a concatenation of the\nview-representations. The MVGF model is trained at sub-field level with 10 m\nresolution pixels. Our evaluations show that the MVGF outperforms conventional\nmodels on the same task, achieving the best results by incorporating all the\ndata sources, unlike the usual fusion results in the literature. For Argentina,\nthe MVGF model achieves an R2 value of 0.68 at sub-field yield prediction,\nwhile at field level evaluation (comparing field averages), it reaches around\n0.80 across different countries. The GU module learned different weights based\non the country and crop-type, aligning with the variable significance of each\ndata source to the prediction task.\n","authors":["Francisco Mena","Deepak Pathak","Hiba Najjar","Cristhian Sanchez","Patrick Helber","Benjamin Bischke","Peter Habelitz","Miro Miranda","Jayanth Siddamsetty","Marlon Nuske","Marcela Charfuelan","Diego Arenas","Michaela Vollmer","Andreas Dengel"],"pdf_url":"https://arxiv.org/pdf/2401.11844v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11840v1","updated":"2024-01-22T10:57:11Z","published":"2024-01-22T10:57:11Z","title":"Learning to Approximate Adaptive Kernel Convolution on Graphs","summary":" Various Graph Neural Networks (GNNs) have been successful in analyzing data\nin non-Euclidean spaces, however, they have limitations such as oversmoothing,\ni.e., information becomes excessively averaged as the number of hidden layers\nincreases. The issue stems from the intrinsic formulation of conventional graph\nconvolution where the nodal features are aggregated from a direct neighborhood\nper layer across the entire nodes in the graph. As setting different number of\nhidden layers per node is infeasible, recent works leverage a diffusion kernel\nto redefine the graph structure and incorporate information from farther nodes.\nUnfortunately, such approaches suffer from heavy diagonalization of a graph\nLaplacian or learning a large transform matrix. In this regards, we propose a\ndiffusion learning framework, where the range of feature aggregation is\ncontrolled by the scale of a diffusion kernel. For efficient computation, we\nderive closed-form derivatives of approximations of the graph convolution with\nrespect to the scale, so that node-wise range can be adaptively learned. With a\ndownstream classifier, the entire framework is made trainable in an end-to-end\nmanner. Our model is tested on various standard datasets for node-wise\nclassification for the state-of-the-art performance, and it is also validated\non a real-world brain network data for graph classifications to demonstrate its\npracticality for Alzheimer classification.\n","authors":["Jaeyoon Sim","Sooyeon Jeon","InJun Choi","Guorong Wu","Won Hwa Kim"],"pdf_url":"https://arxiv.org/pdf/2401.11840v1.pdf","comment":"15 pages, Accepted to AAAI 2024"},{"id":"http://arxiv.org/abs/2401.11836v1","updated":"2024-01-22T10:52:22Z","published":"2024-01-22T10:52:22Z","title":"Privacy-Preserving Data Fusion for Traffic State Estimation: A Vertical\n Federated Learning Approach","summary":" This paper proposes a privacy-preserving data fusion method for traffic state\nestimation (TSE). Unlike existing works that assume all data sources to be\naccessible by a single trusted party, we explicitly address data privacy\nconcerns that arise in the collaboration and data sharing between multiple data\nowners, such as municipal authorities (MAs) and mobility providers (MPs). To\nthis end, we propose a novel vertical federated learning (FL) approach, FedTSE,\nthat enables multiple data owners to collaboratively train and apply a TSE\nmodel without having to exchange their private data. To enhance the\napplicability of the proposed FedTSE in common TSE scenarios with limited\navailability of ground-truth data, we further propose a privacy-preserving\nphysics-informed FL approach, i.e., FedTSE-PI, that integrates traffic models\ninto FL. Real-world data validation shows that the proposed methods can protect\nprivacy while yielding similar accuracy to the oracle method without privacy\nconsiderations.\n","authors":["Qiqing Wang","Kaidi Yang"],"pdf_url":"https://arxiv.org/pdf/2401.11836v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.18394v5","updated":"2024-01-22T10:44:50Z","published":"2023-05-28T12:34:07Z","title":"On Optimal Regularization Parameters via Bilevel Learning","summary":" Variational regularization is commonly used to solve linear inverse problems,\nand involves augmenting a data fidelity by a regularizer. The regularizer is\nused to promote a priori information and is weighted by a regularization\nparameter. Selection of an appropriate regularization parameter is critical,\nwith various choices leading to very different reconstructions. Classical\nstrategies used to determine a suitable parameter value include the discrepancy\nprinciple and the L-curve criterion, and in recent years a supervised machine\nlearning approach called bilevel learning has been employed. Bilevel learning\nis a powerful framework to determine optimal parameters and involves solving a\nnested optimization problem. While previous strategies enjoy various\ntheoretical results, the well-posedness of bilevel learning in this setting is\nstill an open question. In particular, a necessary property is positivity of\nthe determined regularization parameter. In this work, we provide a new\ncondition that better characterizes positivity of optimal regularization\nparameters than the existing theory. Numerical results verify and explore this\nnew condition for both small and high-dimensional problems.\n","authors":["Matthias J. Ehrhardt","Silvia Gazzola","Sebastian J. Scott"],"pdf_url":"https://arxiv.org/pdf/2305.18394v5.pdf","comment":"34 pages, 11 figures. Version for publication"},{"id":"http://arxiv.org/abs/2401.11825v1","updated":"2024-01-22T10:38:14Z","published":"2024-01-22T10:38:14Z","title":"Sparse discovery of differential equations based on multi-fidelity\n Gaussian process","summary":" Sparse identification of differential equations aims to compute the analytic\nexpressions from the observed data explicitly. However, there exist two primary\nchallenges. Firstly, it exhibits sensitivity to the noise in the observed data,\nparticularly for the derivatives computations. Secondly, existing literature\npredominantly concentrates on single-fidelity (SF) data, which imposes\nlimitations on its applicability due to the computational cost. In this paper,\nwe present two novel approaches to address these problems from the view of\nuncertainty quantification. We construct a surrogate model employing the\nGaussian process regression (GPR) to mitigate the effect of noise in the\nobserved data, quantify its uncertainty, and ultimately recover the equations\naccurately. Subsequently, we exploit the multi-fidelity Gaussian processes\n(MFGP) to address scenarios involving multi-fidelity (MF), sparse, and noisy\nobserved data. We demonstrate the robustness and effectiveness of our\nmethodologies through several numerical experiments.\n","authors":["Yuhuang Meng","Yue Qiu"],"pdf_url":"https://arxiv.org/pdf/2401.11825v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.07178v2","updated":"2024-01-22T10:31:56Z","published":"2023-12-12T11:22:31Z","title":"Beyond Expected Return: Accounting for Policy Reproducibility when\n Evaluating Reinforcement Learning Algorithms","summary":" Many applications in Reinforcement Learning (RL) usually have noise or\nstochasticity present in the environment. Beyond their impact on learning,\nthese uncertainties lead the exact same policy to perform differently, i.e.\nyield different return, from one roll-out to another. Common evaluation\nprocedures in RL summarise the consequent return distributions using solely the\nexpected return, which does not account for the spread of the distribution. Our\nwork defines this spread as the policy reproducibility: the ability of a policy\nto obtain similar performance when rolled out many times, a crucial property in\nsome real-world applications. We highlight that existing procedures that only\nuse the expected return are limited on two fronts: first an infinite number of\nreturn distributions with a wide range of performance-reproducibility\ntrade-offs can have the same expected return, limiting its effectiveness when\nused for comparing policies; second, the expected return metric does not leave\nany room for practitioners to choose the best trade-off value for considered\napplications. In this work, we address these limitations by recommending the\nuse of Lower Confidence Bound, a metric taken from Bayesian optimisation that\nprovides the user with a preference parameter to choose a desired\nperformance-reproducibility trade-off. We also formalise and quantify policy\nreproducibility, and demonstrate the benefit of our metrics using extensive\nexperiments of popular RL algorithms on common uncertain RL tasks.\n","authors":["Manon Flageat","Bryan Lim","Antoine Cully"],"pdf_url":"https://arxiv.org/pdf/2312.07178v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11817v1","updated":"2024-01-22T10:26:14Z","published":"2024-01-22T10:26:14Z","title":"Hallucination is Inevitable: An Innate Limitation of Large Language\n Models","summary":" Hallucination has been widely recognized to be a significant drawback for\nlarge language models (LLMs). There have been many works that attempt to reduce\nthe extent of hallucination. These efforts have mostly been empirical so far,\nwhich cannot answer the fundamental question whether it can be completely\neliminated. In this paper, we formalize the problem and show that it is\nimpossible to eliminate hallucination in LLMs. Specifically, we define a formal\nworld where hallucination is defined as inconsistencies between a computable\nLLM and a computable ground truth function. By employing results from learning\ntheory, we show that LLMs cannot learn all of the computable functions and will\ntherefore always hallucinate. Since the formal world is a part of the real\nworld which is much more complicated, hallucinations are also inevitable for\nreal world LLMs. Furthermore, for real world LLMs constrained by provable time\ncomplexity, we describe the hallucination-prone tasks and empirically validate\nour claims. Finally, using the formal world framework, we discuss the possible\nmechanisms and efficacies of existing hallucination mitigators as well as the\npractical implications on the safe deployment of LLMs.\n","authors":["Ziwei Xu","Sanjay Jain","Mohan Kankanhalli"],"pdf_url":"https://arxiv.org/pdf/2401.11817v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11810v1","updated":"2024-01-22T10:14:45Z","published":"2024-01-22T10:14:45Z","title":"Generalization and Informativeness of Conformal Prediction","summary":" The safe integration of machine learning modules in decision-making processes\nhinges on their ability to quantify uncertainty. A popular technique to achieve\nthis goal is conformal prediction (CP), which transforms an arbitrary base\npredictor into a set predictor with coverage guarantees. While CP certifies the\npredicted set to contain the target quantity with a user-defined tolerance, it\ndoes not provide control over the average size of the predicted sets, i.e.,\nover the informativeness of the prediction. In this work, a theoretical\nconnection is established between the generalization properties of the base\npredictor and the informativeness of the resulting CP prediction sets. To this\nend, an upper bound is derived on the expected size of the CP set predictor\nthat builds on generalization error bounds for the base predictor. The derived\nupper bound provides insights into the dependence of the average size of the CP\nset predictor on the amount of calibration data, the target reliability, and\nthe generalization performance of the base predictor. The theoretical insights\nare validated using simple numerical regression and classification tasks.\n","authors":["Matteo Zecchin","Sangwoo Park","Osvaldo Simeone","Fredrik Hellström"],"pdf_url":"https://arxiv.org/pdf/2401.11810v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2212.01168v3","updated":"2024-01-22T10:09:20Z","published":"2022-12-02T13:47:21Z","title":"Towards Cross Domain Generalization of Hamiltonian Representation via\n Meta Learning","summary":" Recent advances in deep learning for physics have focused on discovering\nshared representations of target systems by incorporating physics priors or\ninductive biases into neural networks. While effective, these methods are\nlimited to the system domain, where the type of system remains consistent and\nthus cannot ensure the adaptation to new, or unseen physical systems governed\nby different laws. For instance, a neural network trained on a mass-spring\nsystem cannot guarantee accurate predictions for the behavior of a two-body\nsystem or any other system with different physical laws. In this work, we take\na significant leap forward by targeting cross domain generalization within the\nfield of Hamiltonian dynamics. We model our system with a graph neural network\nand employ a meta learning algorithm to enable the model to gain experience\nover a distribution of tasks and make it adapt to new physics. Our approach\naims to learn a unified Hamiltonian representation that is generalizable across\nmultiple system domains, thereby overcoming the limitations of system-specific\nmodels. Our results demonstrate that the meta-trained model not only adapts\neffectively to new systems but also captures a generalized Hamiltonian\nrepresentation that is consistent across different physical domains. Overall,\nthrough the use of meta learning, we offer a framework that achieves cross\ndomain generalization, providing a step towards a unified model for\nunderstanding a wide array of dynamical systems via deep learning.\n","authors":["Yeongwoo Song","Hawoong Jeong"],"pdf_url":"https://arxiv.org/pdf/2212.01168v3.pdf","comment":"Conference paper at ICLR 2024"},{"id":"http://arxiv.org/abs/2311.06558v2","updated":"2024-01-22T10:07:39Z","published":"2023-11-11T12:28:31Z","title":"Convolve and Conquer: Data Comparison with Wiener Filters","summary":" Quantitative evaluations of differences and/or similarities between data\nsamples define and shape optimisation problems associated with learning data\ndistributions. Current methods to compare data often suffer from limitations in\ncapturing such distributions or lack desirable mathematical properties for\noptimisation (e.g. smoothness, differentiability, or convexity). In this paper,\nwe introduce a new method to measure (dis)similarities between paired samples\ninspired by Wiener-filter theory. The convolutional nature of Wiener filters\nallows us to comprehensively compare data samples in a globally correlated way.\nWe validate our approach in four machine learning applications: data\ncompression, medical imaging imputation, translated classification, and\nnon-parametric generative modelling. Our results demonstrate increased\nresolution in reconstructed images with better perceptual quality and higher\ndata fidelity, as well as robustness against translations, compared to\nconventional mean-squared-error analogue implementations.\n","authors":["Deborah Pelacani Cruz","George Strong","Oscar Bates","Carlos Cueto","Jiashun Yao","Lluis Guasch"],"pdf_url":"https://arxiv.org/pdf/2311.06558v2.pdf","comment":"10 pages, 5 figures, Medical Imaging Meets Neurips Workshop"},{"id":"http://arxiv.org/abs/2401.11798v1","updated":"2024-01-22T09:54:49Z","published":"2024-01-22T09:54:49Z","title":"Knowledge Distillation on Spatial-Temporal Graph Convolutional Network\n for Traffic Prediction","summary":" Efficient real-time traffic prediction is crucial for reducing transportation\ntime. To predict traffic conditions, we employ a spatio-temporal graph neural\nnetwork (ST-GNN) to model our real-time traffic data as temporal graphs.\nDespite its capabilities, it often encounters challenges in delivering\nefficient real-time predictions for real-world traffic data. Recognizing the\nsignificance of timely prediction due to the dynamic nature of real-time data,\nwe employ knowledge distillation (KD) as a solution to enhance the execution\ntime of ST-GNNs for traffic prediction. In this paper, We introduce a cost\nfunction designed to train a network with fewer parameters (the student) using\ndistilled data from a complex network (the teacher) while maintaining its\naccuracy close to that of the teacher. We use knowledge distillation,\nincorporating spatial-temporal correlations from the teacher network to enable\nthe student to learn the complex patterns perceived by the teacher. However, a\nchallenge arises in determining the student network architecture rather than\nconsidering it inadvertently. To address this challenge, we propose an\nalgorithm that utilizes the cost function to calculate pruning scores,\naddressing small network architecture search issues, and jointly fine-tunes the\nnetwork resulting from each pruning stage using KD. Ultimately, we evaluate our\nproposed ideas on two real-world datasets, PeMSD7 and PeMSD8. The results\nindicate that our method can maintain the student's accuracy close to that of\nthe teacher, even with the retention of only $3\\%$ of network parameters.\n","authors":["Mohammad Izadi","Mehran Safayani","Abdolreza Mirzaei"],"pdf_url":"https://arxiv.org/pdf/2401.11798v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.12817v2","updated":"2024-01-22T09:44:18Z","published":"2023-10-19T15:12:44Z","title":"2D-3D Interlaced Transformer for Point Cloud Segmentation with\n Scene-Level Supervision","summary":" We present a Multimodal Interlaced Transformer (MIT) that jointly considers\n2D and 3D data for weakly supervised point cloud segmentation. Research studies\nhave shown that 2D and 3D features are complementary for point cloud\nsegmentation. However, existing methods require extra 2D annotations to achieve\n2D-3D information fusion. Considering the high annotation cost of point clouds,\neffective 2D and 3D feature fusion based on weakly supervised learning is in\ngreat demand. To this end, we propose a transformer model with two encoders and\none decoder for weakly supervised point cloud segmentation using only\nscene-level class tags. Specifically, the two encoders compute the\nself-attended features for 3D point clouds and 2D multi-view images,\nrespectively. The decoder implements interlaced 2D-3D cross-attention and\ncarries out implicit 2D and 3D feature fusion. We alternately switch the roles\nof queries and key-value pairs in the decoder layers. It turns out that the 2D\nand 3D features are iteratively enriched by each other. Experiments show that\nit performs favorably against existing weakly supervised point cloud\nsegmentation methods by a large margin on the S3DIS and ScanNet benchmarks. The\nproject page will be available at https://jimmy15923.github.io/mit_web/.\n","authors":["Cheng-Kun Yang","Min-Hung Chen","Yung-Yu Chuang","Yen-Yu Lin"],"pdf_url":"https://arxiv.org/pdf/2310.12817v2.pdf","comment":"ICCV 2023 (main + supp). Website:\n https://jimmy15923.github.io/mit_web/"},{"id":"http://arxiv.org/abs/2401.11792v1","updated":"2024-01-22T09:44:16Z","published":"2024-01-22T09:44:16Z","title":"Safe and Generalized end-to-end Autonomous Driving System with\n Reinforcement Learning and Demonstrations","summary":" An intelligent driving system should be capable of dynamically formulating\nappropriate driving strategies based on the current environment and vehicle\nstatus, while ensuring the security and reliability of the system. However,\nexisting methods based on reinforcement learning and imitation learning suffer\nfrom low safety, poor generalization, and inefficient sampling. Additionally,\nthey cannot accurately predict future driving trajectories, and the accurate\nprediction of future driving trajectories is a precondition for making optimal\ndecisions. To solve these problems, in this paper, we introduce a Safe and\nGeneralized end-to-end Autonomous Driving System (SGADS) for complex and\nvarious scenarios. Our SGADS incorporates variational inference with\nnormalizing flows, enabling the intelligent vehicle to accurately predict\nfuture driving trajectories. Moreover, we propose the formulation of robust\nsafety constraints. Furthermore, we combine reinforcement learning with\ndemonstrations to augment search process of the agent. The experimental results\ndemonstrate that our SGADS can significantly improve safety performance,\nexhibit strong generalization, and enhance the training efficiency of\nintelligent vehicles in complex urban scenarios compared to existing methods.\n","authors":["Zuojin Tang","Xiaoyu Chen","YongQiang Li","Jianyu Chen"],"pdf_url":"https://arxiv.org/pdf/2401.11792v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11791v1","updated":"2024-01-22T09:41:05Z","published":"2024-01-22T09:41:05Z","title":"SemPLeS: Semantic Prompt Learning for Weakly-Supervised Semantic\n Segmentation","summary":" Weakly-Supervised Semantic Segmentation (WSSS) aims to train segmentation\nmodels using training image data with only image-level supervision. Since\nprecise pixel-level annotations are not accessible, existing methods typically\nfocus on producing pseudo masks for training segmentation models by refining\nCAM-like heatmaps. However, the produced heatmaps may only capture\ndiscriminative image regions of target object categories or the associated\nco-occurring backgrounds. To address the issues, we propose a Semantic Prompt\nLearning for WSSS (SemPLeS) framework, which learns to effectively prompt the\nCLIP space to enhance the semantic alignment between the segmented regions and\nthe target object categories. More specifically, we propose Contrastive Prompt\nLearning and Class-associated Semantic Refinement to learn the prompts that\nadequately describe and suppress the image backgrounds associated with each\ntarget object category. In this way, our proposed framework is able to perform\nbetter semantic matching between object regions and the associated text labels,\nresulting in desired pseudo masks for training the segmentation model. The\nproposed SemPLeS framework achieves SOTA performance on the standard WSSS\nbenchmarks, PASCAL VOC and MS COCO, and demonstrated interpretability with the\nsemantic visualization of our learned prompts. The codes will be released.\n","authors":["Ci-Siang Lin","Chien-Yi Wang","Yu-Chiang Frank Wang","Min-Hung Chen"],"pdf_url":"https://arxiv.org/pdf/2401.11791v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.06101v2","updated":"2024-01-22T09:27:30Z","published":"2023-11-10T15:09:04Z","title":"In-Context Learning for MIMO Equalization Using Transformer-Based\n Sequence Models","summary":" Large pre-trained sequence models, such as transformer-based architectures,\nhave been recently shown to have the capacity to carry out in-context learning\n(ICL). In ICL, a decision on a new input is made via a direct mapping of the\ninput and of a few examples from the given task, serving as the task's context,\nto the output variable. No explicit updates of the model parameters are needed\nto tailor the decision to a new task. Pre-training, which amounts to a form of\nmeta-learning, is based on the observation of examples from several related\ntasks. Prior work has shown ICL capabilities for linear regression. In this\nstudy, we leverage ICL to address the inverse problem of multiple-input and\nmultiple-output (MIMO) equalization based on a context given by pilot symbols.\nA task is defined by the unknown fading channel and by the signal-to-noise\nratio (SNR) level, which may be known. To highlight the practical potential of\nthe approach, we allow the presence of quantization of the received signals. We\ndemonstrate via numerical results that transformer-based ICL has a threshold\nbehavior, whereby, as the number of pre-training tasks grows, the performance\nswitches from that of a minimum mean squared error (MMSE) equalizer with a\nprior determined by the pre-trained tasks to that of an MMSE equalizer with the\ntrue data-generating prior.\n","authors":["Matteo Zecchin","Kai Yu","Osvaldo Simeone"],"pdf_url":"https://arxiv.org/pdf/2311.06101v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11772v1","updated":"2024-01-22T09:09:10Z","published":"2024-01-22T09:09:10Z","title":"LightDiC: A Simple yet Effective Approach for Large-scale Digraph\n Representation Learning","summary":" Most existing graph neural networks (GNNs) are limited to undirected graphs,\nwhose restricted scope of the captured relational information hinders their\nexpressive capabilities and deployments in real-world scenarios. Compared with\nundirected graphs, directed graphs (digraphs) fit the demand for modeling more\ncomplex topological systems by capturing more intricate relationships between\nnodes, such as formulating transportation and financial networks. While some\ndirected GNNs have been introduced, their inspiration mainly comes from deep\nlearning architectures, which lead to redundant complexity and computation,\nmaking them inapplicable to large-scale databases. To address these issues, we\npropose LightDiC, a scalable variant of the digraph convolution based on the\nmagnetic Laplacian. Since topology-related computations are conducted solely\nduring offline pre-processing, LightDiC achieves exceptional scalability,\nenabling downstream predictions to be trained separately without incurring\nrecursive computational costs. Theoretical analysis shows that LightDiC\nutilizes directed information to achieve message passing based on the complex\nfield, which corresponds to the proximal gradient descent process of the\nDirichlet energy optimization function from the perspective of digraph signal\ndenoising, ensuring its expressiveness. Experimental results demonstrate that\nLightDiC performs comparably well or even outperforms other SOTA methods in\nvarious downstream tasks, with fewer learnable parameters and higher training\nefficiency. Notably, LightDiC is the first DiGNN to provide satisfactory\nresults in the most representative large-scale database (ogbn-papers100M).\n","authors":["Xunkai Li","Meihao Liao","Zhengyu Wu","Daohan Su","Wentao Zhang","Rong-Hua Li","Guoren Wang"],"pdf_url":"https://arxiv.org/pdf/2401.11772v1.pdf","comment":"Under Review"},{"id":"http://arxiv.org/abs/2401.11768v1","updated":"2024-01-22T09:03:16Z","published":"2024-01-22T09:03:16Z","title":"ADA-GNN: Atom-Distance-Angle Graph Neural Network for Crystal Material\n Property Prediction","summary":" Property prediction is a fundamental task in crystal material research. To\nmodel atoms and structures, structures represented as graphs are widely used\nand graph learning-based methods have achieved significant progress. Bond\nangles and bond distances are two key structural information that greatly\ninfluence crystal properties. However, most of the existing works only consider\nbond distances and overlook bond angles. The main challenge lies in the time\ncost of handling bond angles, which leads to a significant increase in\ninference time. To solve this issue, we first propose a crystal structure\nmodeling based on dual scale neighbor partitioning mechanism, which uses a\nlarger scale cutoff for edge neighbors and a smaller scale cutoff for angle\nneighbors. Then, we propose a novel Atom-Distance-Angle Graph Neural Network\n(ADA-GNN) for property prediction tasks, which can process node information and\nstructural information separately. The accuracy of predictions and inference\ntime are improved with the dual scale modeling and the specially designed\narchitecture of ADA-GNN. The experimental results validate that our approach\nachieves state-of-the-art results in two large-scale material benchmark\ndatasets on property prediction tasks.\n","authors":["Jiao Huang","Qianli Xing","Jinglong Ji","Bo Yang"],"pdf_url":"https://arxiv.org/pdf/2401.11768v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.05023v2","updated":"2024-01-22T08:47:49Z","published":"2023-06-08T08:22:27Z","title":"Beyond Vanilla Variational Autoencoders: Detecting Posterior Collapse in\n Conditional and Hierarchical Variational Autoencoders","summary":" The posterior collapse phenomenon in variational autoencoder (VAE), where the\nvariational posterior distribution closely matches the prior distribution, can\nhinder the quality of the learned latent variables. As a consequence of\nposterior collapse, the latent variables extracted by the encoder in VAE\npreserve less information from the input data and thus fail to produce\nmeaningful representations as input to the reconstruction process in the\ndecoder. While this phenomenon has been an actively addressed topic related to\nVAE performance, the theory for posterior collapse remains underdeveloped,\nespecially beyond the standard VAE. In this work, we advance the theoretical\nunderstanding of posterior collapse to two important and prevalent yet less\nstudied classes of VAE: conditional VAE and hierarchical VAE. Specifically, via\na non-trivial theoretical analysis of linear conditional VAE and hierarchical\nVAE with two levels of latent, we prove that the cause of posterior collapses\nin these models includes the correlation between the input and output of the\nconditional VAE and the effect of learnable encoder variance in the\nhierarchical VAE. We empirically validate our theoretical findings for linear\nconditional and hierarchical VAE and demonstrate that these results are also\npredictive for non-linear cases with extensive experiments.\n","authors":["Hien Dang","Tho Tran","Tan Nguyen","Nhat Ho"],"pdf_url":"https://arxiv.org/pdf/2306.05023v2.pdf","comment":"International Conference on Learning Representations (ICLR) 2024"},{"id":"http://arxiv.org/abs/2401.11760v1","updated":"2024-01-22T08:45:29Z","published":"2024-01-22T08:45:29Z","title":"Towards Effective and General Graph Unlearning via Mutual Evolution","summary":" With the rapid advancement of AI applications, the growing needs for data\nprivacy and model robustness have highlighted the importance of machine\nunlearning, especially in thriving graph-based scenarios. However, most\nexisting graph unlearning strategies primarily rely on well-designed\narchitectures or manual process, rendering them less user-friendly and posing\nchallenges in terms of deployment efficiency. Furthermore, striking a balance\nbetween unlearning performance and framework generalization is also a pivotal\nconcern. To address the above issues, we propose \\underline{\\textbf{M}}utual\n\\underline{\\textbf{E}}volution \\underline{\\textbf{G}}raph\n\\underline{\\textbf{U}}nlearning (MEGU), a new mutual evolution paradigm that\nsimultaneously evolves the predictive and unlearning capacities of graph\nunlearning. By incorporating aforementioned two components, MEGU ensures\ncomplementary optimization in a unified training framework that aligns with the\nprediction and unlearning requirements. Extensive experiments on 9 graph\nbenchmark datasets demonstrate the superior performance of MEGU in addressing\nunlearning requirements at the feature, node, and edge levels. Specifically,\nMEGU achieves average performance improvements of 2.7\\%, 2.5\\%, and 3.2\\%\nacross these three levels of unlearning tasks when compared to state-of-the-art\nbaselines. Furthermore, MEGU exhibits satisfactory training efficiency,\nreducing time and space overhead by an average of 159.8x and 9.6x,\nrespectively, in comparison to retraining GNN from scratch.\n","authors":["Xunkai Li","Yulin Zhao","Zhengyu Wu","Wentao Zhang","Rong-Hua Li","Guoren Wang"],"pdf_url":"https://arxiv.org/pdf/2401.11760v1.pdf","comment":"Accepted by AAAI 2024 Oral"},{"id":"http://arxiv.org/abs/2401.09953v2","updated":"2024-01-22T08:32:02Z","published":"2024-01-18T12:58:53Z","title":"Through the Dual-Prism: A Spectral Perspective on Graph Data\n Augmentation for Graph Classification","summary":" Graph Neural Networks (GNNs) have become the preferred tool to process graph\ndata, with their efficacy being boosted through graph data augmentation\ntechniques. Despite the evolution of augmentation methods, issues like graph\nproperty distortions and restricted structural changes persist. This leads to\nthe question: Is it possible to develop more property-conserving and\nstructure-sensitive augmentation methods? Through a spectral lens, we\ninvestigate the interplay between graph properties, their augmentation, and\ntheir spectral behavior, and found that keeping the low-frequency eigenvalues\nunchanged can preserve the critical properties at a large scale when generating\naugmented graphs. These observations inform our introduction of the Dual-Prism\n(DP) augmentation method, comprising DP-Noise and DP-Mask, which adeptly\nretains essential graph properties while diversifying augmented graphs.\nExtensive experiments validate the efficiency of our approach, providing a new\nand promising direction for graph data augmentation.\n","authors":["Yutong Xia","Runpeng Yu","Yuxuan Liang","Xavier Bresson","Xinchao Wang","Roger Zimmermann"],"pdf_url":"https://arxiv.org/pdf/2401.09953v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11755v1","updated":"2024-01-22T08:31:53Z","published":"2024-01-22T08:31:53Z","title":"FedGTA: Topology-aware Averaging for Federated Graph Learning","summary":" Federated Graph Learning (FGL) is a distributed machine learning paradigm\nthat enables collaborative training on large-scale subgraphs across multiple\nlocal systems. Existing FGL studies fall into two categories: (i) FGL\nOptimization, which improves multi-client training in existing machine learning\nmodels; (ii) FGL Model, which enhances performance with complex local models\nand multi-client interactions. However, most FGL optimization strategies are\ndesigned specifically for the computer vision domain and ignore graph\nstructure, presenting dissatisfied performance and slow convergence. Meanwhile,\ncomplex local model architectures in FGL Models studies lack scalability for\nhandling large-scale subgraphs and have deployment limitations. To address\nthese issues, we propose Federated Graph Topology-aware Aggregation (FedGTA), a\npersonalized optimization strategy that optimizes through topology-aware local\nsmoothing confidence and mixed neighbor features. During experiments, we deploy\nFedGTA in 12 multi-scale real-world datasets with the Louvain and Metis split.\nThis allows us to evaluate the performance and robustness of FedGTA across a\nrange of scenarios. Extensive experiments demonstrate that FedGTA achieves\nstate-of-the-art performance while exhibiting high scalability and efficiency.\nThe experiment includes ogbn-papers100M, the most representative large-scale\ngraph database so that we can verify the applicability of our method to\nlarge-scale graph learning. To the best of our knowledge, our study is the\nfirst to bridge large-scale graph learning with FGL using this optimization\nstrategy, contributing to the development of efficient and scalable FGL\nmethods.\n","authors":["Xunkai Li","Zhengyu Wu","Wentao Zhang","Yinlin Zhu","Rong-Hua Li","Guoren Wang"],"pdf_url":"https://arxiv.org/pdf/2401.11755v1.pdf","comment":"Accepted by VLDB 2024"},{"id":"http://arxiv.org/abs/2401.11750v1","updated":"2024-01-22T08:23:31Z","published":"2024-01-22T08:23:31Z","title":"AdaFGL: A New Paradigm for Federated Node Classification with Topology\n Heterogeneity","summary":" Recently, Federated Graph Learning (FGL) has attracted significant attention\nas a distributed framework based on graph neural networks, primarily due to its\ncapability to break data silos. Existing FGL studies employ community split on\nthe homophilous global graph by default to simulate federated semi-supervised\nnode classification settings. Such a strategy assumes the consistency of\ntopology between the multi-client subgraphs and the global graph, where\nconnected nodes are highly likely to possess similar feature distributions and\nthe same label. However, in real-world implementations, the varying\nperspectives of local data engineering result in various subgraph topologies,\nposing unique heterogeneity challenges in FGL. Unlike the well-known label\nNon-independent identical distribution (Non-iid) problems in federated\nlearning, FGL heterogeneity essentially reveals the topological divergence\namong multiple clients, namely homophily or heterophily. To simulate and handle\nthis unique challenge, we introduce the concept of structure Non-iid split and\nthen present a new paradigm called \\underline{Ada}ptive \\underline{F}ederated\n\\underline{G}raph \\underline{L}earning (AdaFGL), a decoupled two-step\npersonalized approach. To begin with, AdaFGL employs standard multi-client\nfederated collaborative training to acquire the federated knowledge extractor\nby aggregating uploaded models in the final round at the server. Then, each\nclient conducts personalized training based on the local subgraph and the\nfederated knowledge extractor. Extensive experiments on the 12 graph benchmark\ndatasets validate the superior performance of AdaFGL over state-of-the-art\nbaselines. Specifically, in terms of test accuracy, our proposed AdaFGL\noutperforms baselines by significant margins of 3.24\\% and 5.57\\% on community\nsplit and structure Non-iid split, respectively.\n","authors":["Xunkai Li","Zhengyu Wu","Wentao Zhang","Henan Sun","Rong-Hua Li","Guoren Wang"],"pdf_url":"https://arxiv.org/pdf/2401.11750v1.pdf","comment":"Accepted by ICDE 2024"},{"id":"http://arxiv.org/abs/2401.11748v1","updated":"2024-01-22T08:20:47Z","published":"2024-01-22T08:20:47Z","title":"GI-PIP: Do We Require Impractical Auxiliary Dataset for Gradient\n Inversion Attacks?","summary":" Deep gradient inversion attacks expose a serious threat to Federated Learning\n(FL) by accurately recovering private data from shared gradients. However, the\nstate-of-the-art heavily relies on impractical assumptions to access excessive\nauxiliary data, which violates the basic data partitioning principle of FL. In\nthis paper, a novel method, Gradient Inversion Attack using Practical Image\nPrior (GI-PIP), is proposed under a revised threat model. GI-PIP exploits\nanomaly detection models to capture the underlying distribution from fewer\ndata, while GAN-based methods consume significant more data to synthesize\nimages. The extracted distribution is then leveraged to regulate the attack\nprocess as Anomaly Score loss. Experimental results show that GI-PIP achieves a\n16.12 dB PSNR recovery using only 3.8\\% data of ImageNet, while GAN-based\nmethods necessitate over 70\\%. Moreover, GI-PIP exhibits superior capability on\ndistribution generalization compared to GAN-based methods. Our approach\nsignificantly alleviates the auxiliary data requirement on both amount and\ndistribution in gradient inversion attacks, hence posing more substantial\nthreat to real-world FL.\n","authors":["Yu sun","Gaojian Xiong","Xianxun Yao","Kailang Ma","Jian Cui"],"pdf_url":"https://arxiv.org/pdf/2401.11748v1.pdf","comment":"5pages, 5 figures, accepted to ICASSP 2024, not published yet"},{"id":"http://arxiv.org/abs/2401.10765v2","updated":"2024-01-22T08:17:42Z","published":"2024-01-19T15:37:11Z","title":"Starlit: Privacy-Preserving Federated Learning to Enhance Financial\n Fraud Detection","summary":" Federated Learning (FL) is a data-minimization approach enabling\ncollaborative model training across diverse clients with local data, avoiding\ndirect data exchange. However, state-of-the-art FL solutions to identify\nfraudulent financial transactions exhibit a subset of the following\nlimitations. They (1) lack a formal security definition and proof, (2) assume\nprior freezing of suspicious customers' accounts by financial institutions\n(limiting the solutions' adoption), (3) scale poorly, involving either $O(n^2)$\ncomputationally expensive modular exponentiation (where $n$ is the total number\nof financial institutions) or highly inefficient fully homomorphic encryption,\n(4) assume the parties have already completed the identity alignment phase,\nhence excluding it from the implementation, performance evaluation, and\nsecurity analysis, and (5) struggle to resist clients' dropouts. This work\nintroduces Starlit, a novel scalable privacy-preserving FL mechanism that\novercomes these limitations. It has various applications, such as enhancing\nfinancial fraud detection, mitigating terrorism, and enhancing digital health.\nWe implemented Starlit and conducted a thorough performance analysis using\nsynthetic data from a key player in global financial transactions. The\nevaluation indicates Starlit's scalability, efficiency, and accuracy.\n","authors":["Aydin Abadi","Bradley Doyle","Francesco Gini","Kieron Guinamard","Sasi Kumar Murakonda","Jack Liddell","Paul Mellor","Steven J. Murdoch","Mohammad Naseri","Hector Page","George Theodorakopoulos","Suzanne Weller"],"pdf_url":"https://arxiv.org/pdf/2401.10765v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.19604v3","updated":"2024-01-22T08:13:50Z","published":"2023-05-31T07:22:15Z","title":"Medication Recommendation via Domain Knowledge Informed Deep Learning","summary":" Medication recommendation is a fundamental yet crucial branch of healthcare,\nwhich provides opportunities to support clinical physicians with more accurate\nmedication prescriptions for patients with complex health conditions. Learning\nfrom electronic health records (EHR) to recommend medications is the most\ncommon way in previous studies. However, most of them neglect incorporating\ndomain knowledge according to the clinical manifestations in the EHR of the\npatient. To address these issues, we propose a novel \\textbf{D}omain\n\\textbf{K}nowledge \\textbf{I}nformed \\textbf{Net}work (DKINet) to integrate\ndomain knowledge with observable clinical manifestations of the patient, which\nis the first dynamic domain knowledge informed framework toward medication\nrecommendation. In particular, we first design a knowledge-driven encoder to\ncapture the domain information and then develop a data-driven encoder to\nintegrate domain knowledge into the observable EHR. To endow the model with the\ncapability of temporal decision, we design an explicit medication encoder for\nlearning the longitudinal dependence of the patient. Extensive experiments on\nthree publicly available datasets verify the superiority of our method. The\ncode will be public upon acceptance.\n","authors":["Sicen Liu","Xiaolong Wang","Xianbing Zhao","Hao Chen"],"pdf_url":"https://arxiv.org/pdf/2305.19604v3.pdf","comment":"11 pages, 4 figures"},{"id":"http://arxiv.org/abs/2401.11740v1","updated":"2024-01-22T07:37:25Z","published":"2024-01-22T07:37:25Z","title":"Multi-level Cross-modal Alignment for Image Clustering","summary":" Recently, the cross-modal pretraining model has been employed to produce\nmeaningful pseudo-labels to supervise the training of an image clustering\nmodel. However, numerous erroneous alignments in a cross-modal pre-training\nmodel could produce poor-quality pseudo-labels and degrade clustering\nperformance. To solve the aforementioned issue, we propose a novel\n\\textbf{Multi-level Cross-modal Alignment} method to improve the alignments in\na cross-modal pretraining model for downstream tasks, by building a smaller but\nbetter semantic space and aligning the images and texts in three levels, i.e.,\ninstance-level, prototype-level, and semantic-level. Theoretical results show\nthat our proposed method converges, and suggests effective means to reduce the\nexpected clustering risk of our method. Experimental results on five benchmark\ndatasets clearly show the superiority of our new method.\n","authors":["Liping Qiu","Qin Zhang","Xiaojun Chen","Shaotian Cai"],"pdf_url":"https://arxiv.org/pdf/2401.11740v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11739v1","updated":"2024-01-22T07:34:06Z","published":"2024-01-22T07:34:06Z","title":"EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models","summary":" Diffusion models have recently received increasing research attention for\ntheir remarkable transfer abilities in semantic segmentation tasks. However,\ngenerating fine-grained segmentation masks with diffusion models often requires\nadditional training on annotated datasets, leaving it unclear to what extent\npre-trained diffusion models alone understand the semantic relations of their\ngenerated images. To address this question, we leverage the semantic knowledge\nextracted from Stable Diffusion (SD) and aim to develop an image segmentor\ncapable of generating fine-grained segmentation maps without any additional\ntraining. The primary difficulty stems from the fact that semantically\nmeaningful feature maps typically exist only in the spatially lower-dimensional\nlayers, which poses a challenge in directly extracting pixel-level semantic\nrelations from these feature maps. To overcome this issue, our framework\nidentifies semantic correspondences between image pixels and spatial locations\nof low-dimensional feature maps by exploiting SD's generation process and\nutilizes them for constructing image-resolution segmentation maps. In extensive\nexperiments, the produced segmentation maps are demonstrated to be well\ndelineated and capture detailed parts of the images, indicating the existence\nof highly accurate pixel-level semantic knowledge in diffusion models.\n","authors":["Koichi Namekata","Amirmojtaba Sabour","Sanja Fidler","Seung Wook Kim"],"pdf_url":"https://arxiv.org/pdf/2401.11739v1.pdf","comment":"ICLR 2024. Project page: https://kmcode1.github.io/Projects/EmerDiff/"},{"id":"http://arxiv.org/abs/2401.11736v1","updated":"2024-01-22T07:24:15Z","published":"2024-01-22T07:24:15Z","title":"Attention on Personalized Clinical Decision Support System: Federated\n Learning Approach","summary":" Health management has become a primary problem as new kinds of diseases and\ncomplex symptoms are introduced to a rapidly growing modern society. Building a\nbetter and smarter healthcare infrastructure is one of the ultimate goals of a\nsmart city. To the best of our knowledge, neural network models are already\nemployed to assist healthcare professionals in achieving this goal. Typically,\ntraining a neural network requires a rich amount of data but heterogeneous and\nvulnerable properties of clinical data introduce a challenge for the\ntraditional centralized network. Moreover, adding new inputs to a medical\ndatabase requires re-training an existing model from scratch. To tackle these\nchallenges, we proposed a deep learning-based clinical decision support system\ntrained and managed under a federated learning paradigm. We focused on a novel\nstrategy to guarantee the safety of patient privacy and overcome the risk of\ncyberattacks while enabling large-scale clinical data mining. As a result, we\ncan leverage rich clinical data for training each local neural network without\nthe need for exchanging the confidential data of patients. Moreover, we\nimplemented the proposed scheme as a sequence-to-sequence model architecture\nintegrating the attention mechanism. Thus, our objective is to provide a\npersonalized clinical decision support system with evolvable characteristics\nthat can deliver accurate solutions and assist healthcare professionals in\nmedical diagnosing.\n","authors":["Chu Myaet Thwal","Kyi Thar","Ye Lin Tun","Choong Seon Hong"],"pdf_url":"https://arxiv.org/pdf/2401.11736v1.pdf","comment":"Published in IEEE BigComp 2021"},{"id":"http://arxiv.org/abs/2401.11731v1","updated":"2024-01-22T07:19:16Z","published":"2024-01-22T07:19:16Z","title":"Fast and Scalable Network Slicing by Integrating Deep Learning with\n Lagrangian Methods","summary":" Network slicing is a key technique in 5G and beyond for efficiently\nsupporting diverse services. Many network slicing solutions rely on deep\nlearning to manage complex and high-dimensional resource allocation problems.\nHowever, deep learning models suffer limited generalization and adaptability to\ndynamic slicing configurations. In this paper, we propose a novel framework\nthat integrates constrained optimization methods and deep learning models,\nresulting in strong generalization and superior approximation capability. Based\non the proposed framework, we design a new neural-assisted algorithm to\nallocate radio resources to slices to maximize the network utility under\ninter-slice resource constraints. The algorithm exhibits high scalability,\naccommodating varying numbers of slices and slice configurations with ease. We\nimplement the proposed solution in a system-level network simulator and\nevaluate its performance extensively by comparing it to state-of-the-art\nsolutions including deep reinforcement learning approaches. The numerical\nresults show that our solution obtains near-optimal quality-of-service\nsatisfaction and promising generalization performance under different network\nslicing scenarios.\n","authors":["Tianlun Hu","Qi Liao","Qiang Liu","Antonio Massaro","Georg Carle"],"pdf_url":"https://arxiv.org/pdf/2401.11731v1.pdf","comment":"6 pages, 5 figures, IEEE Global Communications Conference 2023"},{"id":"http://arxiv.org/abs/2305.00418v3","updated":"2024-01-22T07:09:17Z","published":"2023-04-30T07:28:06Z","title":"An Empirical Study of Using Large Language Models for Unit Test\n Generation","summary":" A code generation model generates code by taking a prompt from a code\ncomment, existing code, or a combination of both. Although code generation\nmodels (e.g., GitHub Copilot) are increasingly being adopted in practice, it is\nunclear whether they can successfully be used for unit test generation without\nfine-tuning for a strongly typed language like Java. To fill this gap, we\ninvestigated how well three models (Codex, GPT-3.5-Turbo, and StarCoder) can\ngenerate unit tests. We used two benchmarks (HumanEval and Evosuite SF110) to\ninvestigate the effect of context generation on the unit test generation\nprocess. We evaluated the models based on compilation rates, test correctness,\ntest coverage, and test smells. We found that the Codex model achieved above\n80% coverage for the HumanEval dataset, but no model had more than 2% coverage\nfor the EvoSuite SF110 benchmark. The generated tests also suffered from test\nsmells, such as Duplicated Asserts and Empty Tests.\n","authors":["Mohammed Latif Siddiq","Joanna C. S. Santos","Ridwanul Hasan Tanvir","Noshin Ulfat","Fahmid Al Rifat","Vinicius Carvalho Lopes"],"pdf_url":"https://arxiv.org/pdf/2305.00418v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11726v1","updated":"2024-01-22T07:07:32Z","published":"2024-01-22T07:07:32Z","title":"Detecting Out-of-Distribution Samples via Conditional Distribution\n Entropy with Optimal Transport","summary":" When deploying a trained machine learning model in the real world, it is\ninevitable to receive inputs from out-of-distribution (OOD) sources. For\ninstance, in continual learning settings, it is common to encounter OOD samples\ndue to the non-stationarity of a domain. More generally, when we have access to\na set of test inputs, the existing rich line of OOD detection solutions,\nespecially the recent promise of distance-based methods, falls short in\neffectively utilizing the distribution information from training samples and\ntest inputs. In this paper, we argue that empirical probability distributions\nthat incorporate geometric information from both training samples and test\ninputs can be highly beneficial for OOD detection in the presence of test\ninputs available. To address this, we propose to model OOD detection as a\ndiscrete optimal transport problem. Within the framework of optimal transport,\nwe propose a novel score function known as the \\emph{conditional distribution\nentropy} to quantify the uncertainty of a test input being an OOD sample. Our\nproposal inherits the merits of certain distance-based methods while\neliminating the reliance on distribution assumptions, a-prior knowledge, and\nspecific training mechanisms. Extensive experiments conducted on benchmark\ndatasets demonstrate that our method outperforms its competitors in OOD\ndetection.\n","authors":["Chuanwen Feng","Wenlong Chen","Ao Ke","Yilong Ren","Xike Xie","S. Kevin Zhou"],"pdf_url":"https://arxiv.org/pdf/2401.11726v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11720v1","updated":"2024-01-22T06:47:00Z","published":"2024-01-22T06:47:00Z","title":"Graph Condensation: A Survey","summary":" The burgeoning volume of graph data poses significant challenges in storage,\ntransmission, and particularly the training of graph neural networks (GNNs). To\naddress these challenges, graph condensation (GC) has emerged as an innovative\nsolution. GC focuses on synthesizing a compact yet highly representative graph,\non which GNNs can achieve performance comparable to trained on the large\noriginal graph. The notable efficacy of GC and its broad prospects have\ngarnered significant attention and spurred extensive research. This survey\npaper provides an up-to-date and systematic overview of GC, organizing existing\nresearch into four categories aligned with critical GC evaluation criteria:\neffectiveness, generalization, fairness, and efficiency. To facilitate an\nin-depth and comprehensive understanding of GC, we examine various methods\nunder each category and thoroughly discuss two essential components within GC:\noptimization strategies and condensed graph generation. Additionally, we\nintroduce the applications of GC in a variety of fields, and highlight the\npresent challenges and novel insights in GC, promoting advancements in future\nresearch.\n","authors":["Xinyi Gao","Junliang Yu","Wei Jiang","Tong Chen","Wentao Zhang","Hongzhi Yin"],"pdf_url":"https://arxiv.org/pdf/2401.11720v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11698v1","updated":"2024-01-22T05:44:43Z","published":"2024-01-22T05:44:43Z","title":"Admission Prediction in Undergraduate Applications: an Interpretable\n Deep Learning Approach","summary":" This article addresses the challenge of validating the admission committee's\ndecisions for undergraduate admissions. In recent years, the traditional review\nprocess has struggled to handle the overwhelmingly large amount of applicants'\ndata. Moreover, this traditional assessment often leads to human bias, which\nmight result in discrimination among applicants. Although classical machine\nlearning-based approaches exist that aim to verify the quantitative assessment\nmade by the application reviewers, these methods lack scalability and suffer\nfrom performance issues when a large volume of data is in place. In this\ncontext, we propose deep learning-based classifiers, namely Feed-Forward and\nInput Convex neural networks, which overcome the challenges faced by the\nexisting methods. Furthermore, we give additional insights into our model by\nincorporating an interpretability module, namely LIME. Our training and test\ndatasets comprise applicants' data with a wide range of variables and\ninformation. Our models achieve higher accuracy compared to the best-performing\ntraditional machine learning-based approach by a considerable margin of 3.03\\%.\nAdditionally, we show the sensitivity of different features and their relative\nimpacts on the overall admission decision using the LIME technique.\n","authors":["Amisha Priyadarshini","Barbara Martinez-Neda","Sergio Gago-Masague"],"pdf_url":"https://arxiv.org/pdf/2401.11698v1.pdf","comment":"This paper has been accepted for Transdisciplinary AI 2023 conference"},{"id":"http://arxiv.org/abs/2401.11694v1","updated":"2024-01-22T05:26:18Z","published":"2024-01-22T05:26:18Z","title":"Parametric Matrix Models","summary":" We present a general class of machine learning algorithms called parametric\nmatrix models. Parametric matrix models are based on matrix equations, and the\ndesign is motivated by the efficiency of reduced basis methods for\napproximating solutions of parametric equations. The dependent variables can be\ndefined implicitly or explicitly, and the equations may use algebraic,\ndifferential, or integral relations. Parametric matrix models can be trained\nwith empirical data only, and no high-fidelity model calculations are needed.\nWhile originally designed for scientific computing, parametric matrix models\nare universal function approximators that can be applied to general machine\nlearning problems. After introducing the underlying theory, we apply parametric\nmatrix models to a series of different challenges that show their performance\nfor a wide range of problems. For all the challenges tested here, parametric\nmatrix models produce accurate results within a computational framework that\nallows for parameter extrapolation and interpretability.\n","authors":["Patrick Cook","Danny Jammooa","Morten Hjorth-Jensen","Daniel D. Lee","Dean Lee"],"pdf_url":"https://arxiv.org/pdf/2401.11694v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10371v2","updated":"2024-01-22T05:24:17Z","published":"2024-01-18T20:35:47Z","title":"Langevin Unlearning: A New Perspective of Noisy Gradient Descent for\n Machine Unlearning","summary":" Machine unlearning has raised significant interest with the adoption of laws\nensuring the ``right to be forgotten''. Researchers have provided a\nprobabilistic notion of approximate unlearning under a similar definition of\nDifferential Privacy (DP), where privacy is defined as statistical\nindistinguishability to retraining from scratch. We propose Langevin\nunlearning, an unlearning framework based on noisy gradient descent with\nprivacy guarantees for approximate unlearning problems. Langevin unlearning\nunifies the DP learning process and the privacy-certified unlearning process\nwith many algorithmic benefits. These include approximate certified unlearning\nfor non-convex problems, complexity saving compared to retraining, sequential\nand batch unlearning for multiple unlearning requests. We verify the\npracticality of Langevin unlearning by studying its privacy-utility-complexity\ntrade-off via experiments on benchmark datasets, and also demonstrate its\nsuperiority against gradient-decent-plus-output-perturbation based approximate\nunlearning.\n","authors":["Eli Chien","Haoyu Wang","Ziang Chen","Pan Li"],"pdf_url":"https://arxiv.org/pdf/2401.10371v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11687v1","updated":"2024-01-22T04:54:42Z","published":"2024-01-22T04:54:42Z","title":"TIM: An Efficient Temporal Interaction Module for Spiking Transformer","summary":" Spiking Neural Networks (SNNs), as the third generation of neural networks,\nhave gained prominence for their biological plausibility and computational\nefficiency, especially in processing diverse datasets. The integration of\nattention mechanisms, inspired by advancements in neural network architectures,\nhas led to the development of Spiking Transformers. These have shown promise in\nenhancing SNNs' capabilities, particularly in the realms of both static and\nneuromorphic datasets. Despite their progress, a discernible gap exists in\nthese systems, specifically in the Spiking Self Attention (SSA) mechanism's\neffectiveness in leveraging the temporal processing potential of SNNs. To\naddress this, we introduce the Temporal Interaction Module (TIM), a novel,\nconvolution-based enhancement designed to augment the temporal data processing\nabilities within SNN architectures. TIM's integration into existing SNN\nframeworks is seamless and efficient, requiring minimal additional parameters\nwhile significantly boosting their temporal information handling capabilities.\nThrough rigorous experimentation, TIM has demonstrated its effectiveness in\nexploiting temporal information, leading to state-of-the-art performance across\nvarious neuromorphic datasets.\n","authors":["Sicheng Shen","Dongcheng Zhao","Guobin Shen","Yi Zeng"],"pdf_url":"https://arxiv.org/pdf/2401.11687v1.pdf","comment":"10pages,6figures"},{"id":"http://arxiv.org/abs/2310.03298v3","updated":"2024-01-22T04:39:36Z","published":"2023-10-05T03:56:09Z","title":"A Latent Variable Approach for Non-Hierarchical Multi-Fidelity Adaptive\n Sampling","summary":" Multi-fidelity (MF) methods are gaining popularity for enhancing surrogate\nmodeling and design optimization by incorporating data from various\nlow-fidelity (LF) models. While most existing MF methods assume a fixed\ndataset, adaptive sampling methods that dynamically allocate resources among\nfidelity models can achieve higher efficiency in the exploring and exploiting\nthe design space. However, most existing MF methods rely on the hierarchical\nassumption of fidelity levels or fail to capture the intercorrelation between\nmultiple fidelity levels and utilize it to quantify the value of the future\nsamples and navigate the adaptive sampling. To address this hurdle, we propose\na framework hinged on a latent embedding for different fidelity models and the\nassociated pre-posterior analysis to explicitly utilize their correlation for\nadaptive sampling. In this framework, each infill sampling iteration includes\ntwo steps: We first identify the location of interest with the greatest\npotential improvement using the high-fidelity (HF) model, then we search for\nthe next sample across all fidelity levels that maximize the improvement per\nunit cost at the location identified in the first step. This is made possible\nby a single Latent Variable Gaussian Process (LVGP) model that maps different\nfidelity models into an interpretable latent space to capture their\ncorrelations without assuming hierarchical fidelity levels. The LVGP enables us\nto assess how LF sampling candidates will affect HF response with pre-posterior\nanalysis and determine the next sample with the best benefit-to-cost ratio.\nThrough test cases, we demonstrate that the proposed method outperforms the\nbenchmark methods in both MF global fitting (GF) and Bayesian Optimization (BO)\nproblems in convergence rate and robustness. Moreover, the method offers the\nflexibility to switch between GF and BO by simply changing the acquisition\nfunction.\n","authors":["Yi-Ping Chen","Liwei Wang","Yigitcan Comlek","Wei Chen"],"pdf_url":"https://arxiv.org/pdf/2310.03298v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.13158v2","updated":"2024-01-22T03:47:17Z","published":"2023-07-24T22:52:02Z","title":"Multi-UAV Speed Control with Collision Avoidance and Handover-aware Cell\n Association: DRL with Action Branching","summary":" This paper presents a deep reinforcement learning solution for optimizing\nmulti-UAV cell-association decisions and their moving velocity on a 3D aerial\nhighway. The objective is to enhance transportation and communication\nperformance, including collision avoidance, connectivity, and handovers. The\nproblem is formulated as a Markov decision process (MDP) with UAVs' states\ndefined by velocities and communication data rates. We propose a neural\narchitecture with a shared decision module and multiple network branches, each\ndedicated to a specific action dimension in a 2D transportation-communication\nspace. This design efficiently handles the multi-dimensional action space,\nallowing independence for individual action dimensions. We introduce two\nmodels, Branching Dueling Q-Network (BDQ) and Branching Dueling Double Deep\nQ-Network (Dueling DDQN), to demonstrate the approach. Simulation results show\na significant improvement of 18.32% compared to existing benchmarks.\n","authors":["Zijiang Yan","Wael Jaafar","Bassant Selim","Hina Tabassum"],"pdf_url":"https://arxiv.org/pdf/2307.13158v2.pdf","comment":"IEEE Globecom 2023 Accepted"},{"id":"http://arxiv.org/abs/2401.11679v1","updated":"2024-01-22T03:44:35Z","published":"2024-01-22T03:44:35Z","title":"Simulating Nighttime Visible Satellite Imagery of Tropical Cyclones\n Using Conditional Generative Adversarial Networks","summary":" Visible (VIS) imagery of satellites has various important applications in\nmeteorology, including monitoring Tropical Cyclones (TCs). However, it is\nunavailable at night because of the lack of sunlight. This study presents a\nConditional Generative Adversarial Networks (CGAN) model that generates highly\naccurate nighttime visible reflectance using infrared (IR) bands and sunlight\ndirection parameters as input. The model was trained and validated using target\narea observations of the Advanced Himawari Imager (AHI) in the daytime. This\nstudy also presents the first nighttime model validation using the Day/Night\nBand (DNB) of the Visible/Infrared Imager Radiometer Suite (VIIRS). The daytime\nstatistical results of the Structural Similarity Index Measure (SSIM), Peak\nSignal-to-Noise Ratio (PSNR), Root Mean Square Error (RMSE), Correlation\nCoefficient (CC), and Bias are 0.885, 28.3, 0.0428, 0.984, and -0.0016\nrespectively, completely surpassing the model performance of previous studies.\nThe nighttime statistical results of SSIM, PSNR, RMSE, and CC are 0.821, 24.4,\n0.0643, and 0.969 respectively, which are slightly negatively impacted by the\nparallax between satellites. We performed full-disk model validation which\nproves our model could also be readily applied in the tropical ocean without\nTCs in the northern hemisphere. This model contributes to the nighttime\nmonitoring of meteorological phenomena by providing accurate AI-generated\nvisible imagery with adjustable virtual sunlight directions.\n","authors":["Jinghuai Yao","Puyuan Du","Yucheng Zhao","Yubo Wang"],"pdf_url":"https://arxiv.org/pdf/2401.11679v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.01841v3","updated":"2024-01-22T03:43:34Z","published":"2024-01-03T17:19:54Z","title":"Act as You Learn: Adaptive Decision-Making in Non-Stationary Markov\n Decision Processes","summary":" A fundamental (and largely open) challenge in sequential decision-making is\ndealing with non-stationary environments, where exogenous environmental\nconditions change over time. Such problems are traditionally modeled as\nnon-stationary Markov decision processes (NSMDP). However, existing approaches\nfor decision-making in NSMDPs have two major shortcomings: first, they assume\nthat the updated environmental dynamics at the current time are known (although\nfuture dynamics can change); and second, planning is largely pessimistic, i.e.,\nthe agent acts ``safely'' to account for the non-stationary evolution of the\nenvironment. We argue that both these assumptions are invalid in practice --\nupdated environmental conditions are rarely known, and as the agent interacts\nwith the environment, it can learn about the updated dynamics and avoid being\npessimistic, at least in states whose dynamics it is confident about. We\npresent a heuristic search algorithm called \\textit{Adaptive Monte Carlo Tree\nSearch (ADA-MCTS)} that addresses these challenges. We show that the agent can\nlearn the updated dynamics of the environment over time and then act as it\nlearns, i.e., if the agent is in a region of the state space about which it has\nupdated knowledge, it can avoid being pessimistic. To quantify ``updated\nknowledge,'' we disintegrate the aleatoric and epistemic uncertainty in the\nagent's updated belief and show how the agent can use these estimates for\ndecision-making. We compare the proposed approach with the multiple\nstate-of-the-art approaches in decision-making across multiple well-established\nopen-source problems and empirically show that our approach is faster and\nhighly adaptive without sacrificing safety.\n","authors":["Baiting Luo","Yunuo Zhang","Abhishek Dubey","Ayan Mukhopadhyay"],"pdf_url":"https://arxiv.org/pdf/2401.01841v3.pdf","comment":"Accepted for publication at the International Conference on\n Autonomous Agents and MultiAgent Systems (AAMAS), 2024"},{"id":"http://arxiv.org/abs/2401.11671v1","updated":"2024-01-22T03:09:00Z","published":"2024-01-22T03:09:00Z","title":"RTA-Former: Reverse Transformer Attention for Polyp Segmentation","summary":" Polyp segmentation is a key aspect of colorectal cancer prevention, enabling\nearly detection and guiding subsequent treatments. Intelligent diagnostic\ntools, including deep learning solutions, are widely explored to streamline and\npotentially automate this process. However, even with many powerful network\narchitectures, there still comes the problem of producing accurate edge\nsegmentation. In this paper, we introduce a novel network, namely RTA-Former,\nthat employs a transformer model as the encoder backbone and innovatively\nadapts Reverse Attention (RA) with a transformer stage in the decoder for\nenhanced edge segmentation. The results of the experiments illustrate that\nRTA-Former achieves state-of-the-art (SOTA) performance in five polyp\nsegmentation datasets. The strong capability of RTA-Former holds promise in\nimproving the accuracy of Transformer-based polyp segmentation, potentially\nleading to better clinical decisions and patient outcomes. Our code will be\npublicly available on GitHub.\n","authors":["Zhikai Li","Murong Yi","Ali Uneri","Sihan Niu","Craig Jones"],"pdf_url":"https://arxiv.org/pdf/2401.11671v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11669v1","updated":"2024-01-22T03:07:24Z","published":"2024-01-22T03:07:24Z","title":"An Improved Grey Wolf Optimization Algorithm for Heart Disease\n Prediction","summary":" This paper presents a unique solution to challenges in medical image\nprocessing by incorporating an adaptive curve grey wolf optimization (ACGWO)\nalgorithm into neural network backpropagation. Neural networks show potential\nin medical data but suffer from issues like overfitting and lack of\ninterpretability due to imbalanced and scarce data. Traditional Gray Wolf\nOptimization (GWO) also has its drawbacks, such as a lack of population\ndiversity and premature convergence. This paper addresses these problems by\nintroducing an adaptive algorithm, enhancing the standard GWO with a sigmoid\nfunction. This algorithm was extensively compared to four leading algorithms\nusing six well-known test functions, outperforming them effectively. Moreover,\nby utilizing the ACGWO, we increase the robustness and generalization of the\nneural network, resulting in more interpretable predictions. Applied to the\npublicly accessible Cleveland Heart Disease dataset, our technique surpasses\nten other methods, achieving 86.8% accuracy, indicating its potential for\nefficient heart disease prediction in the clinical setting.\n","authors":["Sihan Niu","Yifan Zhou","Zhikai Li","Shuyao Huang","Yujun Zhou"],"pdf_url":"https://arxiv.org/pdf/2401.11669v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11667v1","updated":"2024-01-22T02:59:27Z","published":"2024-01-22T02:59:27Z","title":"INCPrompt: Task-Aware incremental Prompting for Rehearsal-Free\n Class-incremental Learning","summary":" This paper introduces INCPrompt, an innovative continual learning solution\nthat effectively addresses catastrophic forgetting. INCPrompt's key innovation\nlies in its use of adaptive key-learner and task-aware prompts that capture\ntask-relevant information. This unique combination encapsulates general\nknowledge across tasks and encodes task-specific knowledge. Our comprehensive\nevaluation across multiple continual learning benchmarks demonstrates\nINCPrompt's superiority over existing algorithms, showing its effectiveness in\nmitigating catastrophic forgetting while maintaining high performance. These\nresults highlight the significant impact of task-aware incremental prompting on\ncontinual learning performance.\n","authors":["Zhiyuan Wang","Xiaoyang Qu","Jing Xiao","Bokui Chen","Jianzong Wang"],"pdf_url":"https://arxiv.org/pdf/2401.11667v1.pdf","comment":"Accepted by the 49th IEEE International Conference on Acoustics,\n Speech, and Signal Processing (ICASSP 2024)"},{"id":"http://arxiv.org/abs/2401.11666v1","updated":"2024-01-22T02:58:53Z","published":"2024-01-22T02:58:53Z","title":"P2DT: Mitigating Forgetting in task-incremental Learning with\n progressive prompt Decision Transformer","summary":" Catastrophic forgetting poses a substantial challenge for managing\nintelligent agents controlled by a large model, causing performance degradation\nwhen these agents face new tasks. In our work, we propose a novel solution -\nthe Progressive Prompt Decision Transformer (P2DT). This method enhances a\ntransformer-based model by dynamically appending decision tokens during new\ntask training, thus fostering task-specific policies. Our approach mitigates\nforgetting in continual and offline reinforcement learning scenarios. Moreover,\nP2DT leverages trajectories collected via traditional reinforcement learning\nfrom all tasks and generates new task-specific tokens during training, thereby\nretaining knowledge from previous studies. Preliminary results demonstrate that\nour model effectively alleviates catastrophic forgetting and scales well with\nincreasing task environments.\n","authors":["Zhiyuan Wang","Xiaoyang Qu","Jing Xiao","Bokui Chen","Jianzong Wang"],"pdf_url":"https://arxiv.org/pdf/2401.11666v1.pdf","comment":"Accepted by the 49th IEEE International Conference on Acoustics,\n Speech, and Signal Processing (ICASSP 2024)"},{"id":"http://arxiv.org/abs/2212.00325v2","updated":"2024-01-22T02:56:53Z","published":"2022-12-01T07:19:17Z","title":"HashVFL: Defending Against Data Reconstruction Attacks in Vertical\n Federated Learning","summary":" Vertical Federated Learning (VFL) is a trending collaborative machine\nlearning model training solution. Existing industrial frameworks employ secure\nmulti-party computation techniques such as homomorphic encryption to ensure\ndata security and privacy. Despite these efforts, studies have revealed that\ndata leakage remains a risk in VFL due to the correlations between intermediate\nrepresentations and raw data. Neural networks can accurately capture these\ncorrelations, allowing an adversary to reconstruct the data. This emphasizes\nthe need for continued research into securing VFL systems.\n Our work shows that hashing is a promising solution to counter data\nreconstruction attacks. The one-way nature of hashing makes it difficult for an\nadversary to recover data from hash codes. However, implementing hashing in VFL\npresents new challenges, including vanishing gradients and information loss. To\naddress these issues, we propose HashVFL, which integrates hashing and\nsimultaneously achieves learnability, bit balance, and consistency.\n Experimental results indicate that HashVFL effectively maintains task\nperformance while defending against data reconstruction attacks. It also brings\nadditional benefits in reducing the degree of label leakage, mitigating\nadversarial attacks, and detecting abnormal inputs. We hope our work will\ninspire further research into the potential applications of HashVFL.\n","authors":["Pengyu Qiu","Xuhong Zhang","Shouling Ji","Chong Fu","Xing Yang","Ting Wang"],"pdf_url":"https://arxiv.org/pdf/2212.00325v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11665v1","updated":"2024-01-22T02:54:58Z","published":"2024-01-22T02:54:58Z","title":"Accelerating Approximate Thompson Sampling with Underdamped Langevin\n Monte Carlo","summary":" Approximate Thompson sampling with Langevin Monte Carlo broadens its reach\nfrom Gaussian posterior sampling to encompass more general smooth posteriors.\nHowever, it still encounters scalability issues in high-dimensional problems\nwhen demanding high accuracy. To address this, we propose an approximate\nThompson sampling strategy, utilizing underdamped Langevin Monte Carlo, where\nthe latter is the go-to workhorse for simulations of high-dimensional\nposteriors. Based on the standard smoothness and log-concavity conditions, we\nstudy the accelerated posterior concentration and sampling using a specific\npotential function. This design improves the sample complexity for realizing\nlogarithmic regrets from $\\mathcal{\\tilde O}(d)$ to $\\mathcal{\\tilde\nO}(\\sqrt{d})$. The scalability and robustness of our algorithm are also\nempirically validated through synthetic experiments in high-dimensional bandit\nproblems.\n","authors":["Haoyang Zheng","Wei Deng","Christian Moya","Guang Lin"],"pdf_url":"https://arxiv.org/pdf/2401.11665v1.pdf","comment":"50 pages, 1 figure, to appear in AISTATS 2024"},{"id":"http://arxiv.org/abs/2401.11664v1","updated":"2024-01-22T02:50:38Z","published":"2024-01-22T02:50:38Z","title":"Zero-Space Cost Fault Tolerance for Transformer-based Language Models on\n ReRAM","summary":" Resistive Random Access Memory (ReRAM) has emerged as a promising platform\nfor deep neural networks (DNNs) due to its support for parallel in-situ\nmatrix-vector multiplication. However, hardware failures, such as\nstuck-at-fault defects, can result in significant prediction errors during\nmodel inference. While additional crossbars can be used to address these\nfailures, they come with storage overhead and are not efficient in terms of\nspace, energy, and cost. In this paper, we propose a fault protection mechanism\nthat incurs zero space cost. Our approach includes: 1) differentiable structure\npruning of rows and columns to reduce model redundancy, 2) weight duplication\nand voting for robust output, and 3) embedding duplicated most significant bits\n(MSBs) into the model weight. We evaluate our method on nine tasks of the GLUE\nbenchmark with the BERT model, and experimental results prove its\neffectiveness.\n","authors":["Bingbing Li","Geng Yuan","Zigeng Wang","Shaoyi Huang","Hongwu Peng","Payman Behnam","Wujie Wen","Hang Liu","Caiwen Ding"],"pdf_url":"https://arxiv.org/pdf/2401.11664v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11835v2","updated":"2024-01-22T02:48:48Z","published":"2023-12-19T04:03:47Z","title":"Provably Convergent Federated Trilevel Learning","summary":" Trilevel learning, also called trilevel optimization (TLO), has been\nrecognized as a powerful modelling tool for hierarchical decision process and\nwidely applied in many machine learning applications, such as robust neural\narchitecture search, hyperparameter optimization, and domain adaptation.\nTackling TLO problems has presented a great challenge due to their nested\ndecision-making structure. In addition, existing works on TLO face the\nfollowing key challenges: 1) they all focus on the non-distributed setting,\nwhich may lead to privacy breach; 2) they do not offer any non-asymptotic\nconvergence analysis which characterizes how fast an algorithm converges. To\naddress the aforementioned challenges, this paper proposes an asynchronous\nfederated trilevel optimization method to solve TLO problems. The proposed\nmethod utilizes $\\mu$-cuts to construct a hyper-polyhedral approximation for\nthe TLO problem and solve it in an asynchronous manner. We demonstrate that the\nproposed $\\mu$-cuts are applicable to not only convex functions but also a wide\nrange of non-convex functions that meet the $\\mu$-weakly convex assumption.\nFurthermore, we theoretically analyze the non-asymptotic convergence rate for\nthe proposed method by showing its iteration complexity to obtain\n$\\epsilon$-stationary point is upper bounded by\n$\\mathcal{O}(\\frac{1}{\\epsilon^2})$. Extensive experiments on real-world\ndatasets have been conducted to elucidate the superiority of the proposed\nmethod, e.g., it has a faster convergence rate with a maximum acceleration of\napproximately 80$\\%$.\n","authors":["Yang Jiao","Kai Yang","Tiancheng Wu","Chengtao Jian","Jianwei Huang"],"pdf_url":"https://arxiv.org/pdf/2312.11835v2.pdf","comment":"Accepted at AAAI 2024"},{"id":"http://arxiv.org/abs/2305.16789v2","updated":"2024-01-22T02:47:50Z","published":"2023-05-26T09:59:48Z","title":"Modulate Your Spectrum in Self-Supervised Learning","summary":" Whitening loss offers a theoretical guarantee against feature collapse in\nself-supervised learning (SSL) with joint embedding architectures. Typically,\nit involves a hard whitening approach, transforming the embedding and applying\nloss to the whitened output. In this work, we introduce Spectral Transformation\n(ST), a framework to modulate the spectrum of embedding and to seek for\nfunctions beyond whitening that can avoid dimensional collapse. We show that\nwhitening is a special instance of ST by definition, and our empirical\ninvestigations unveil other ST instances capable of preventing collapse.\nAdditionally, we propose a novel ST instance named IterNorm with trace loss\n(INTL). Theoretical analysis confirms INTL's efficacy in preventing collapse\nand modulating the spectrum of embedding toward equal-eigenvalues during\noptimization. Our experiments on ImageNet classification and COCO object\ndetection demonstrate INTL's potential in learning superior representations.\nThe code is available at https://github.com/winci-ai/INTL.\n","authors":["Xi Weng","Yunhao Ni","Tengwei Song","Jie Luo","Rao Muhammad Anwer","Salman Khan","Fahad Shahbaz Khan","Lei Huang"],"pdf_url":"https://arxiv.org/pdf/2305.16789v2.pdf","comment":"Accepted at ICLR 2024. The code is available at\n https://github.com/winci-ai/intl"},{"id":"http://arxiv.org/abs/2401.11660v1","updated":"2024-01-22T02:33:38Z","published":"2024-01-22T02:33:38Z","title":"Differentiable Tree Search in Latent State Space","summary":" In decision-making problems with limited training data, policy functions\napproximated using deep neural networks often exhibit suboptimal performance.\nAn alternative approach involves learning a world model from the limited data\nand determining actions through online search. However, the performance is\nadversely affected by compounding errors arising from inaccuracies in the\nlearnt world model. While methods like TreeQN have attempted to address these\ninaccuracies by incorporating algorithmic structural biases into their\narchitectures, the biases they introduce are often weak and insufficient for\ncomplex decision-making tasks. In this work, we introduce Differentiable Tree\nSearch (DTS), a novel neural network architecture that significantly\nstrengthens the inductive bias by embedding the algorithmic structure of a\nbest-first online search algorithm. DTS employs a learnt world model to conduct\na fully differentiable online search in latent state space. The world model is\njointly optimised with the search algorithm, enabling the learning of a robust\nworld model and mitigating the effect of model inaccuracies. We address\npotential Q-function discontinuities arising from naive incorporation of\nbest-first search by adopting a stochastic tree expansion policy, formulating\nsearch tree expansion as a decision-making task, and introducing an effective\nvariance reduction technique for the gradient computation. We evaluate DTS in\nan offline-RL setting with a limited training data scenario on Procgen games\nand grid navigation task, and demonstrate that DTS outperforms popular\nmodel-free and model-based baselines.\n","authors":["Dixant Mittal","Wee Sun Lee"],"pdf_url":"https://arxiv.org/pdf/2401.11660v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.06333v2","updated":"2024-01-22T02:22:12Z","published":"2023-10-10T06:03:51Z","title":"Learning bounded-degree polytrees with known skeleton","summary":" We establish finite-sample guarantees for efficient proper learning of\nbounded-degree polytrees, a rich class of high-dimensional probability\ndistributions and a subclass of Bayesian networks, a widely-studied type of\ngraphical model. Recently, Bhattacharyya et al. (2021) obtained finite-sample\nguarantees for recovering tree-structured Bayesian networks, i.e., 1-polytrees.\nWe extend their results by providing an efficient algorithm which learns\n$d$-polytrees in polynomial time and sample complexity for any bounded $d$ when\nthe underlying undirected graph (skeleton) is known. We complement our\nalgorithm with an information-theoretic sample complexity lower bound, showing\nthat the dependence on the dimension and target accuracy parameters are nearly\ntight.\n","authors":["Davin Choo","Joy Qiping Yang","Arnab Bhattacharyya","Clément L. Canonne"],"pdf_url":"https://arxiv.org/pdf/2310.06333v2.pdf","comment":"Fixed some typos. Added some discussions. Accepted to ALT 2024"},{"id":"http://arxiv.org/abs/2401.11652v1","updated":"2024-01-22T02:17:36Z","published":"2024-01-22T02:17:36Z","title":"OnDev-LCT: On-Device Lightweight Convolutional Transformers towards\n federated learning","summary":" Federated learning (FL) has emerged as a promising approach to\ncollaboratively train machine learning models across multiple edge devices\nwhile preserving privacy. The success of FL hinges on the efficiency of\nparticipating models and their ability to handle the unique challenges of\ndistributed learning. While several variants of Vision Transformer (ViT) have\nshown great potential as alternatives to modern convolutional neural networks\n(CNNs) for centralized training, the unprecedented size and higher\ncomputational demands hinder their deployment on resource-constrained edge\ndevices, challenging their widespread application in FL. Since client devices\nin FL typically have limited computing resources and communication bandwidth,\nmodels intended for such devices must strike a balance between model size,\ncomputational efficiency, and the ability to adapt to the diverse and non-IID\ndata distributions encountered in FL. To address these challenges, we propose\nOnDev-LCT: Lightweight Convolutional Transformers for On-Device vision tasks\nwith limited training data and resources. Our models incorporate image-specific\ninductive biases through the LCT tokenizer by leveraging efficient depthwise\nseparable convolutions in residual linear bottleneck blocks to extract local\nfeatures, while the multi-head self-attention (MHSA) mechanism in the LCT\nencoder implicitly facilitates capturing global representations of images.\nExtensive experiments on benchmark image datasets indicate that our models\noutperform existing lightweight vision models while having fewer parameters and\nlower computational demands, making them suitable for FL scenarios with data\nheterogeneity and communication bottlenecks.\n","authors":["Chu Myaet Thwal","Minh N. H. Nguyen","Ye Lin Tun","Seong Tae Kim","My T. Thai","Choong Seon Hong"],"pdf_url":"https://arxiv.org/pdf/2401.11652v1.pdf","comment":"Published in Neural Networks"},{"id":"http://arxiv.org/abs/2312.02277v2","updated":"2024-01-22T02:03:50Z","published":"2023-12-04T19:00:07Z","title":"ALEXR: An Optimal Single-Loop Algorithm for Convex Finite-Sum Coupled\n Compositional Stochastic Optimization","summary":" This paper revisits a class of convex Finite-Sum Coupled Compositional\nStochastic Optimization (cFCCO) problems with many applications, including\ngroup distributionally robust optimization (GDRO), learning with imbalanced\ndata, reinforcement learning, and learning to rank. To better solve these\nproblems, we introduce an efficient single-loop primal-dual block-coordinate\nproximal algorithm, dubbed ALEXR. This algorithm leverages block-coordinate\nstochastic mirror ascent updates for the dual variable and stochastic proximal\ngradient descent updates for the primal variable. We establish the convergence\nrates of ALEXR in both convex and strongly convex cases under smoothness and\nnon-smoothness conditions of involved functions, which not only improve the\nbest rates in previous works on smooth cFCCO problems but also expand the realm\nof cFCCO for solving more challenging non-smooth problems such as the dual form\nof GDRO. Finally, we present lower complexity bounds to demonstrate that the\nconvergence rates of ALEXR are optimal among first-order block-coordinate\nstochastic algorithms for the considered class of cFCCO problems.\n","authors":["Bokun Wang","Tianbao Yang"],"pdf_url":"https://arxiv.org/pdf/2312.02277v2.pdf","comment":"Fixed several typos; Added some numerical experiments"},{"id":"http://arxiv.org/abs/2401.11648v1","updated":"2024-01-22T01:58:32Z","published":"2024-01-22T01:58:32Z","title":"Next Visit Diagnosis Prediction via Medical Code-Centric Multimodal\n Contrastive EHR Modelling with Hierarchical Regularisation","summary":" Predicting next visit diagnosis using Electronic Health Records (EHR) is an\nessential task in healthcare, critical for devising proactive future plans for\nboth healthcare providers and patients. Nonetheless, many preceding studies\nhave not sufficiently addressed the heterogeneous and hierarchical\ncharacteristics inherent in EHR data, inevitably leading to sub-optimal\nperformance. To this end, we propose NECHO, a novel medical code-centric\nmultimodal contrastive EHR learning framework with hierarchical regularisation.\nFirst, we integrate multifaceted information encompassing medical codes,\ndemographics, and clinical notes using a tailored network design and a pair of\nbimodal contrastive losses, all of which pivot around a medical code\nrepresentation. We also regularise modality-specific encoders using a parental\nlevel information in medical ontology to learn hierarchical structure of EHR\ndata. A series of experiments on MIMIC-III data demonstrates effectiveness of\nour approach.\n","authors":["Heejoon Koo"],"pdf_url":"https://arxiv.org/pdf/2401.11648v1.pdf","comment":"Accepted to EACL 2024 (The 18th Conference of the European Chapter of\n the Association for Computational Linguistics)"},{"id":"http://arxiv.org/abs/2401.11647v1","updated":"2024-01-22T01:57:31Z","published":"2024-01-22T01:57:31Z","title":"LW-FedSSL: Resource-efficient Layer-wise Federated Self-supervised\n Learning","summary":" Many recent studies integrate federated learning (FL) with self-supervised\nlearning (SSL) to take advantage of raw training data distributed across edge\ndevices. However, edge devices often struggle with high computation and\ncommunication costs imposed by SSL and FL algorithms. To tackle this hindrance,\nwe propose LW-FedSSL, a layer-wise federated self-supervised learning approach\nthat allows edge devices to incrementally train one layer of the model at a\ntime. LW-FedSSL comprises server-side calibration and representation alignment\nmechanisms to maintain comparable performance with end-to-end FedSSL while\nsignificantly lowering clients' resource requirements. The server-side\ncalibration mechanism takes advantage of the resource-rich server in an FL\nenvironment to assist in global model training. Meanwhile, the representation\nalignment mechanism encourages closeness between representations of FL local\nmodels and those of the global model. Our experiments show that LW-FedSSL has a\n$3.3 \\times$ lower memory requirement and a $3.2 \\times$ cheaper communication\ncost than its end-to-end counterpart. We also explore a progressive training\nstrategy called Prog-FedSSL that outperforms end-to-end training with a similar\nmemory requirement and a $1.8 \\times$ cheaper communication cost.\n","authors":["Ye Lin Tun","Chu Myaet Thwal","Le Quang Huy","Minh N. H. Nguyen","Choong Seon Hong"],"pdf_url":"https://arxiv.org/pdf/2401.11647v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11646v1","updated":"2024-01-22T01:45:34Z","published":"2024-01-22T01:45:34Z","title":"Nonparametric Estimation via Variance-Reduced Sketching","summary":" Nonparametric models are of great interest in various scientific and\nengineering disciplines. Classical kernel methods, while numerically robust and\nstatistically sound in low-dimensional settings, become inadequate in\nhigher-dimensional settings due to the curse of dimensionality. In this paper,\nwe introduce a new framework called Variance-Reduced Sketching (VRS),\nspecifically designed to estimate density functions and nonparametric\nregression functions in higher dimensions with a reduced curse of\ndimensionality. Our framework conceptualizes multivariable functions as\ninfinite-size matrices, and facilitates a new sketching technique motivated by\nnumerical linear algebra literature to reduce the variance in estimation\nproblems. We demonstrate the robust numerical performance of VRS through a\nseries of simulated experiments and real-world data applications. Notably, VRS\nshows remarkable improvement over existing neural network estimators and\nclassical kernel methods in numerous density estimation and nonparametric\nregression models. Additionally, we offer theoretical justifications for VRS to\nsupport its ability to deliver nonparametric estimation with a reduced curse of\ndimensionality.\n","authors":["Yuehaw Khoo","Yifan Peng","Daren Wang"],"pdf_url":"https://arxiv.org/pdf/2401.11646v1.pdf","comment":"64 pages, 8 figures"},{"id":"http://arxiv.org/abs/2312.16113v2","updated":"2024-01-22T01:38:12Z","published":"2023-12-20T08:16:53Z","title":"Task-Driven Causal Feature Distillation: Towards Trustworthy Risk\n Prediction","summary":" Since artificial intelligence has seen tremendous recent successes in many\nareas, it has sparked great interest in its potential for trustworthy and\ninterpretable risk prediction. However, most models lack causal reasoning and\nstruggle with class imbalance, leading to poor precision and recall. To address\nthis, we propose a Task-Driven Causal Feature Distillation model (TDCFD) to\ntransform original feature values into causal feature attributions for the\nspecific risk prediction task. The causal feature attribution helps describe\nhow much contribution the value of this feature can make to the risk prediction\nresult. After the causal feature distillation, a deep neural network is applied\nto produce trustworthy prediction results with causal interpretability and high\nprecision/recall. We evaluate the performance of our TDCFD method on several\nsynthetic and real datasets, and the results demonstrate its superiority over\nthe state-of-the-art methods regarding precision, recall, interpretability, and\ncausality.\n","authors":["Zhixuan Chu","Mengxuan Hu","Qing Cui","Longfei Li","Sheng Li"],"pdf_url":"https://arxiv.org/pdf/2312.16113v2.pdf","comment":"Proceedings of the 2024 AAAI Conference on Artificial Intelligence"},{"id":"http://arxiv.org/abs/2109.01636v4","updated":"2024-01-22T01:23:23Z","published":"2021-09-03T17:28:04Z","title":"Empirical Study of Named Entity Recognition Performance Using\n Distribution-aware Word Embedding","summary":" With the fast development of Deep Learning techniques, Named Entity\nRecognition (NER) is becoming more and more important in the information\nextraction task. The greatest difficulty that the NER task faces is to keep the\ndetectability even when types of NE and documents are unfamiliar. Realizing\nthat the specificity information may contain potential meanings of a word and\ngenerate semantic-related features for word embedding, we develop a\ndistribution-aware word embedding and implement three different methods to make\nuse of the distribution information in a NER framework. And the result shows\nthat the performance of NER will be improved if the word specificity is\nincorporated into existing NER methods.\n","authors":["Xin Chen","Qi Zhao","Xinyang Liu"],"pdf_url":"https://arxiv.org/pdf/2109.01636v4.pdf","comment":"Want to correct"},{"id":"http://arxiv.org/abs/2401.01084v2","updated":"2024-01-22T01:16:24Z","published":"2024-01-02T07:56:17Z","title":"Global Convergence of Natural Policy Gradient with Hessian-aided\n Momentum Variance Reduction","summary":" Natural policy gradient (NPG) and its variants are widely-used policy search\nmethods in reinforcement learning. Inspired by prior work, a new NPG variant\ncoined NPG-HM is developed in this paper, which utilizes the Hessian-aided\nmomentum technique for variance reduction, while the sub-problem is solved via\nthe stochastic gradient descent method. It is shown that NPG-HM can achieve the\nglobal last iterate $\\epsilon$-optimality with a sample complexity of\n$\\mathcal{O}(\\epsilon^{-2})$, which is the best known result for natural policy\ngradient type methods under the generic Fisher non-degenerate policy\nparameterizations. The convergence analysis is built upon a relaxed weak\ngradient dominance property tailored for NPG under the compatible function\napproximation framework, as well as a neat way to decompose the error when\nhandling the sub-problem. Moreover, numerical experiments on Mujoco-based\nenvironments demonstrate the superior performance of NPG-HM over other\nstate-of-the-art policy gradient methods.\n","authors":["Jie Feng","Ke Wei","Jinchi Chen"],"pdf_url":"https://arxiv.org/pdf/2401.01084v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.17778v3","updated":"2024-01-22T00:54:30Z","published":"2023-06-30T16:31:14Z","title":"Look, Remember and Reason: Grounded reasoning in videos with language\n models","summary":" Multi-modal language models (LM) have recently shown promising performance in\nhigh-level reasoning tasks on videos. However, existing methods still fall\nshort in tasks like causal or compositional spatiotemporal reasoning over\nactions, in which model predictions need to be grounded in fine-grained\nlow-level details, such as object motions and object interactions. In this\nwork, we propose training an LM end-to-end on low-level surrogate tasks,\nincluding object detection, re-identification, and tracking, to endow the model\nwith the required low-level visual capabilities. We show that a two-stream\nvideo encoder with spatiotemporal attention is effective at capturing the\nrequired static and motion-based cues in the video. By leveraging the LM's\nability to perform the low-level surrogate tasks, we can cast reasoning in\nvideos as the three-step process of Look, Remember, Reason wherein visual\ninformation is extracted using low-level visual skills step-by-step and then\nintegrated to arrive at a final answer. We demonstrate the effectiveness of our\nframework on diverse visual reasoning tasks from the ACRE, CATER,\nSomething-Else and STAR datasets. Our approach is trainable end-to-end and\nsurpasses state-of-the-art task-specific methods across these tasks by a large\nmargin.\n","authors":["Apratim Bhattacharyya","Sunny Panchal","Mingu Lee","Reza Pourreza","Pulkit Madan","Roland Memisevic"],"pdf_url":"https://arxiv.org/pdf/2306.17778v3.pdf","comment":"To appear at ICLR 2024"},{"id":"http://arxiv.org/abs/2306.09136v3","updated":"2024-01-22T00:51:05Z","published":"2023-06-15T13:49:30Z","title":"Finite-Time Logarithmic Bayes Regret Upper Bounds","summary":" We derive the first finite-time logarithmic Bayes regret upper bounds for\nBayesian bandits. In a multi-armed bandit, we obtain $O(c_\\Delta \\log n)$ and\n$O(c_h \\log^2 n)$ upper bounds for an upper confidence bound algorithm, where\n$c_h$ and $c_\\Delta$ are constants depending on the prior distribution and the\ngaps of bandit instances sampled from it, respectively. The latter bound\nasymptotically matches the lower bound of Lai (1987). Our proofs are a major\ntechnical departure from prior works, while being simple and general. To show\nthe generality of our techniques, we apply them to linear bandits. Our results\nprovide insights on the value of prior in the Bayesian setting, both in the\nobjective and as a side information given to the learner. They significantly\nimprove upon existing $\\tilde{O}(\\sqrt{n})$ bounds, which have become standard\nin the literature despite the logarithmic lower bound of Lai (1987).\n","authors":["Alexia Atsidakou","Branislav Kveton","Sumeet Katariya","Constantine Caramanis","Sujay Sanghavi"],"pdf_url":"https://arxiv.org/pdf/2306.09136v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13118v2","updated":"2024-01-22T00:50:55Z","published":"2023-12-20T15:37:50Z","title":"LRS: Enhancing Adversarial Transferability through Lipschitz Regularized\n Surrogate","summary":" The transferability of adversarial examples is of central importance to\ntransfer-based black-box adversarial attacks. Previous works for generating\ntransferable adversarial examples focus on attacking \\emph{given} pretrained\nsurrogate models while the connections between surrogate models and adversarial\ntrasferability have been overlooked. In this paper, we propose {\\em Lipschitz\nRegularized Surrogate} (LRS) for transfer-based black-box attacks, a novel\napproach that transforms surrogate models towards favorable adversarial\ntransferability. Using such transformed surrogate models, any existing\ntransfer-based black-box attack can run without any change, yet achieving much\nbetter performance. Specifically, we impose Lipschitz regularization on the\nloss landscape of surrogate models to enable a smoother and more controlled\noptimization process for generating more transferable adversarial examples. In\naddition, this paper also sheds light on the connection between the inner\nproperties of surrogate models and adversarial transferability, where three\nfactors are identified: smaller local Lipschitz constant, smoother loss\nlandscape, and stronger adversarial robustness. We evaluate our proposed LRS\napproach by attacking state-of-the-art standard deep neural networks and\ndefense models. The results demonstrate significant improvement on the attack\nsuccess rates and transferability. Our code is available at\nhttps://github.com/TrustAIoT/LRS.\n","authors":["Tao Wu","Tie Luo","Donald C. Wunsch"],"pdf_url":"https://arxiv.org/pdf/2312.13118v2.pdf","comment":"AAAI 2024 main track. Code available on Github (see abstract).\n Appendix is included in this updated version"},{"id":"http://arxiv.org/abs/2206.14358v2","updated":"2024-01-22T00:38:08Z","published":"2022-06-29T01:57:44Z","title":"Using Twitter Data to Understand Public Perceptions of Approved versus\n Off-label Use for COVID-19-related Medications","summary":" Understanding public discourse on emergency use of unproven therapeutics is\ncrucial for monitoring safe use and combating misinformation. We developed a\nnatural language processing-based pipeline to comprehend public perceptions of\nand stances on coronavirus disease 2019 (COVID-19)-related drugs on Twitter\nover time. This retrospective study included 609,189 US-based tweets from\nJanuary 29, 2020, to November 30, 2021, about four drugs that garnered\nsignificant public attention during the COVID-19 pandemic: (1)\nHydroxychloroquine and Ivermectin, therapies with anecdotal evidence; and (2)\nMolnupiravir and Remdesivir, FDA-approved treatments for eligible patients.\nTime-trend analysis was employed to understand popularity trends and related\nevents. Content and demographic analyses were conducted to explore potential\nrationales behind people's stances on each drug. Time-trend analysis indicated\nthat Hydroxychloroquine and Ivermectin were discussed more than Molnupiravir\nand Remdesivir, particularly during COVID-19 surges. Hydroxychloroquine and\nIvermectin discussions were highly politicized, related to conspiracy theories,\nhearsay, and celebrity influences. The distribution of stances between the two\nmajor US political parties was significantly different (P < .001); Republicans\nwere more likely to support Hydroxychloroquine (55%) and Ivermectin (30%) than\nDemocrats. People with healthcare backgrounds tended to oppose\nHydroxychloroquine (7%) more than the general population, while the general\npopulation was more likely to support Ivermectin (14%). Our study found that\nsocial media users have varying perceptions and stances on off-label versus\nFDA-authorized drug use at different stages of COVID-19. This indicates that\nhealth systems, regulatory agencies, and policymakers should design tailored\nstrategies to monitor and reduce misinformation to promote safe drug use.\n","authors":["Yining Hua","Hang Jiang","Shixu Lin","Jie Yang","Joseph M. Plasek","David W. Bates","Li Zhou"],"pdf_url":"https://arxiv.org/pdf/2206.14358v2.pdf","comment":"Full paper published in JAMIA"},{"id":"http://arxiv.org/abs/2310.17168v2","updated":"2024-01-22T00:12:20Z","published":"2023-10-26T05:49:13Z","title":"Learning an Inventory Control Policy with General Inventory Arrival\n Dynamics","summary":" In this paper we address the problem of learning and backtesting inventory\ncontrol policies in the presence of general arrival dynamics -- which we term\nas a quantity-over-time arrivals model (QOT). We also allow for order\nquantities to be modified as a post-processing step to meet vendor constraints\nsuch as order minimum and batch size constraints -- a common practice in real\nsupply chains. To the best of our knowledge this is the first work to handle\neither arbitrary arrival dynamics or an arbitrary downstream post-processing of\norder quantities. Building upon recent work (Madeka et al., 2022) we similarly\nformulate the periodic review inventory control problem as an exogenous\ndecision process, where most of the state is outside the control of the agent.\nMadeka et al., 2022 show how to construct a simulator that replays historic\ndata to solve this class of problem. In our case, we incorporate a deep\ngenerative model for the arrivals process as part of the history replay. By\nformulating the problem as an exogenous decision process, we can apply results\nfrom Madeka et al., 2022 to obtain a reduction to supervised learning. Via\nsimulation studies we show that this approach yields statistically significant\nimprovements in profitability over production baselines. Using data from a\nreal-world A/B test, we show that Gen-QOT generalizes well to off-policy data\nand that the resulting buying policy outperforms traditional inventory\nmanagement systems in real world settings.\n","authors":["Sohrab Andaz","Carson Eisenach","Dhruv Madeka","Kari Torkkola","Randy Jia","Dean Foster","Sham Kakade"],"pdf_url":"https://arxiv.org/pdf/2310.17168v2.pdf","comment":null}],"Multimedia":[{"id":"http://arxiv.org/abs/2310.00647v2","updated":"2024-01-22T18:53:48Z","published":"2023-10-01T12:02:59Z","title":"Beyond Task Performance: Evaluating and Reducing the Flaws of Large\n Multimodal Models with In-Context Learning","summary":" Following the success of Large Language Models (LLMs), Large Multimodal\nModels (LMMs), such as the Flamingo model and its subsequent competitors, have\nstarted to emerge as natural steps towards generalist agents. However,\ninteracting with recent LMMs reveals major limitations that are hardly captured\nby the current evaluation benchmarks. Indeed, task performances (e.g., VQA\naccuracy) alone do not provide enough clues to understand their real\ncapabilities, limitations, and to which extent such models are aligned to human\nexpectations. To refine our understanding of those flaws, we deviate from the\ncurrent evaluation paradigm, and (1) evaluate 10 recent open-source LMMs from\n3B up to 80B parameter scale, on 5 different axes; hallucinations, abstention,\ncompositionality, explainability and instruction following. Our evaluation on\nthese axes reveals major flaws in LMMs. While the current go-to solution to\nalign these models is based on training, such as instruction tuning or RLHF, we\nrather (2) explore the training-free in-context learning (ICL) as a solution,\nand study how it affects these limitations. Based on our ICL study, (3) we push\nICL further and propose new multimodal ICL variants such as; Multitask-ICL,\nChain-of-Hindsight-ICL, and Self-Correcting-ICL. Our findings are as follows.\n(1) Despite their success, LMMs have flaws that remain unsolved with scaling\nalone. (2) The effect of ICL on LMMs flaws is nuanced; despite its\neffectiveness for improved explainability, answer abstention, ICL only slightly\nimproves instruction following, does not improve compositional abilities, and\nactually even amplifies hallucinations. (3) The proposed ICL variants are\npromising as post-hoc approaches to efficiently tackle some of those flaws. The\ncode is available here: https://github.com/mshukor/EvALign-ICL.\n","authors":["Mustafa Shukor","Alexandre Rame","Corentin Dancette","Matthieu Cord"],"pdf_url":"https://arxiv.org/pdf/2310.00647v2.pdf","comment":"ICLR 2024. Project Page: https://evalign-icl.github.io/"},{"id":"http://arxiv.org/abs/2401.11943v1","updated":"2024-01-22T13:33:53Z","published":"2024-01-22T13:33:53Z","title":"Benchmarking Large Multimodal Models against Common Corruptions","summary":" This technical report aims to fill a deficiency in the assessment of large\nmultimodal models (LMMs) by specifically examining the self-consistency of\ntheir outputs when subjected to common corruptions. We investigate the\ncross-modal interactions between text, image, and speech, encompassing four\nessential generation tasks: text-to-image, image-to-text, text-to-speech, and\nspeech-to-text. We create a comprehensive benchmark, named MMCBench, that\ncovers more than 100 popular LMMs (totally over 150 model checkpoints). A\nthorough evaluation under common corruptions is critical for practical\ndeployment and facilitates a better understanding of the reliability of\ncutting-edge LMMs. The benchmarking code is available at\nhttps://github.com/sail-sg/MMCBench\n","authors":["Jiawei Zhang","Tianyu Pang","Chao Du","Yi Ren","Bo Li","Min Lin"],"pdf_url":"https://arxiv.org/pdf/2401.11943v1.pdf","comment":"Technical report"},{"id":"http://arxiv.org/abs/2401.11818v1","updated":"2024-01-22T10:26:52Z","published":"2024-01-22T10:26:52Z","title":"MInD: Improving Multimodal Sentiment Analysis via Multimodal Information\n Disentanglement","summary":" Learning effective joint representations has been a central task in\nmultimodal sentiment analysis. Previous methods focus on leveraging the\ncorrelations between different modalities and enhancing performance through\nsophisticated fusion techniques. However, challenges still exist due to the\ninherent heterogeneity of distinct modalities, which may lead to distributional\ngap, impeding the full exploitation of inter-modal information and resulting in\nredundancy and impurity in the information extracted from features. To address\nthis problem, we introduce the Multimodal Information Disentanglement (MInD)\napproach. MInD decomposes the multimodal inputs into a modality-invariant\ncomponent, a modality-specific component, and a remnant noise component for\neach modality through a shared encoder and multiple private encoders. The\nshared encoder aims to explore the shared information and commonality across\nmodalities, while the private encoders are deployed to capture the distinctive\ninformation and characteristic features. These representations thus furnish a\ncomprehensive perspective of the multimodal data, facilitating the fusion\nprocess instrumental for subsequent prediction tasks. Furthermore, MInD\nimproves the learned representations by explicitly modeling the task-irrelevant\nnoise in an adversarial manner. Experimental evaluations conducted on benchmark\ndatasets, including CMU-MOSI, CMU-MOSEI, and UR-Funny, demonstrate MInD's\nsuperior performance over existing state-of-the-art methods in both multimodal\nemotion recognition and multimodal humor detection tasks.\n","authors":["Weichen Dai","Xingyu Li","Pengbo Hu","Zeyu Wang","Ji Qi","Jianlin Peng","Yi Zhou"],"pdf_url":"https://arxiv.org/pdf/2401.11818v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11764v1","updated":"2024-01-22T08:59:09Z","published":"2024-01-22T08:59:09Z","title":"Identity-Driven Multimedia Forgery Detection via Reference Assistance","summary":" Recent advancements in technologies, such as the 'deepfake' technique, have\npaved the way for the generation of various media forgeries. In response to the\npotential hazards of these media forgeries, many researchers engage in\nexploring detection methods, increasing the demand for high-quality media\nforgery datasets. Despite this, existing datasets have certain limitations.\nFirstly, most of datasets focus on the manipulation of visual modality and\nusually lack diversity, as only a few forgery approaches are considered.\nSecondly, the quality of media is often inadequate in clarity and naturalness.\nMeanwhile, the size of the dataset is also limited. Thirdly, while many\nreal-world forgeries are driven by identity, the identity information of the\nsubject in media is frequently neglected. For detection, identity information\ncould be an essential clue to boost accuracy. Moreover, official media\nconcerning certain identities on the Internet can serve as prior knowledge,\naiding both the audience and forgery detectors in determining the true\nidentity. Therefore, we propose an identity-driven multimedia forgery dataset,\nIDForge, which contains 249,138 video shots. All video shots are sourced from\n324 wild videos collected of 54 celebrities from the Internet. The fake video\nshots involve 9 types of manipulation across visual, audio and textual\nmodalities. Additionally, IDForge provides extra 214,438 real video shots as a\nreference set for the 54 celebrities. Correspondingly, we design an effective\nmultimedia detection network, Reference-assisted Multimodal Forgery Detection\nNetwork (R-MFDN). Through extensive experiments on the proposed dataset, we\ndemonstrate the effectiveness of R-MFDN on the multimedia detection task.\n","authors":["Junhao Xu","Jingjing Chen","Xue Song","Feng Han","Haijun Shan","Yugang Jiang"],"pdf_url":"https://arxiv.org/pdf/2401.11764v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12264v1","updated":"2024-01-22T08:16:48Z","published":"2024-01-22T08:16:48Z","title":"CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model\n for Multimodal Processing","summary":" There has been a long-standing quest for a unified audio-visual-text model to\nenable various multimodal understanding tasks, which mimics the listening,\nseeing and reading process of human beings. Humans tends to represent knowledge\nusing two separate systems: one for representing verbal (textual) information\nand one for representing non-verbal (visual and auditory) information. These\ntwo systems can operate independently but can also interact with each other.\nMotivated by this understanding of human cognition, in this paper, we introduce\nCoAVT -- a novel cognition-inspired Correlated Audio-Visual-Text pre-training\nmodel to connect the three modalities. It contains a joint audio-visual encoder\nthat learns to encode audio-visual synchronization information together with\nthe audio and visual content for non-verbal information, and a text encoder to\nhandle textual input for verbal information. To bridge the gap between\nmodalities, CoAVT employs a query encoder, which contains a set of learnable\nquery embeddings, and extracts the most informative audiovisual features of the\ncorresponding text. Additionally, to leverage the correspondences between audio\nand vision with language respectively, we also establish the audio-text and\nvisual-text bi-modal alignments upon the foundational audiovisual-text\ntri-modal alignment to enhance the multimodal representation learning. Finally,\nwe jointly optimize CoAVT model with three multimodal objectives: contrastive\nloss, matching loss and language modeling loss. Extensive experiments show that\nCoAVT can learn strong multimodal correlations and be generalized to various\ndownstream tasks. CoAVT establishes new state-of-the-art performance on\ntext-video retrieval task on AudioCaps for both zero-shot and fine-tuning\nsettings, audio-visual event classification and audio-visual retrieval tasks on\nAudioSet and VGGSound.\n","authors":["Xianghu Yue","Xiaohai Tian","Malu Zhang","Zhizheng Wu","Haizhou Li"],"pdf_url":"https://arxiv.org/pdf/2401.12264v1.pdf","comment":null}]},"2024-01-21T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2401.11631v1","updated":"2024-01-21T23:54:05Z","published":"2024-01-21T23:54:05Z","title":"Text-to-Image Cross-Modal Generation: A Systematic Review","summary":" We review research on generating visual data from text from the angle of\n\"cross-modal generation.\" This point of view allows us to draw parallels\nbetween various methods geared towards working on input text and producing\nvisual output, without limiting the analysis to narrow sub-areas. It also\nresults in the identification of common templates in the field, which are then\ncompared and contrasted both within pools of similar methods and across lines\nof research. We provide a breakdown of text-to-image generation into various\nflavors of image-from-text methods, video-from-text methods, image editing,\nself-supervised and graph-based approaches. In this discussion, we focus on\nresearch papers published at 8 leading machine learning conferences in the\nyears 2016-2022, also incorporating a number of relevant papers not matching\nthe outlined search criteria. The conducted review suggests a significant\nincrease in the number of papers published in the area and highlights research\ngaps and potential lines of investigation. To our knowledge, this is the first\nreview to systematically look at text-to-image generation from the perspective\nof \"cross-modal generation.\"\n","authors":["Maciej Żelaszczyk","Jacek Mańdziuk"],"pdf_url":"https://arxiv.org/pdf/2401.11631v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11626v1","updated":"2024-01-21T23:37:33Z","published":"2024-01-21T23:37:33Z","title":"Freely Long-Thinking Transformer (FraiLT)","summary":" Freely Long-Thinking Transformer (FraiLT) is an improved transformer model\ndesigned to enhance processing capabilities without scaling up size. It\nutilizes a recursive approach, iterating over a subset of layers multiple\ntimes, and introduces iteration encodings to maintain awareness across these\ncycles. Iteration encoding allows FraiLT to achieve the interpretive depth of\nlarger models in a compact form. When evaluated on a synthetic story dataset,\nFraiLT outperformed larger models, showcasing its ability to deliver\nhigh-quality performance while reducing memory demands. This model represents a\nstep forward towards more efficient and accessible language models.\n","authors":["Akbay Tabak"],"pdf_url":"https://arxiv.org/pdf/2401.11626v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11624v1","updated":"2024-01-21T23:34:42Z","published":"2024-01-21T23:34:42Z","title":"In-context Learning with Retrieved Demonstrations for Language Models: A\n Survey","summary":" Language models, especially pre-trained large language models, have showcased\nremarkable abilities as few-shot in-context learners (ICL), adept at adapting\nto new tasks with just a few demonstrations in the input context. However, the\nmodel's ability to perform ICL is sensitive to the choice of the few-shot\ndemonstrations. Instead of using a fixed set of demonstrations, one recent\ndevelopment is to retrieve demonstrations tailored to each input query. The\nimplementation of demonstration retrieval is relatively straightforward,\nleveraging existing databases and retrieval systems. This not only improves the\nefficiency and scalability of the learning process but also has been shown to\nreduce biases inherent in manual example selection. In light of the encouraging\nresults and growing research in ICL with retrieved demonstrations, we conduct\nan extensive review of studies in this area. In this survey, we discuss and\ncompare different design choices for retrieval models, retrieval training\nprocedures, and inference algorithms.\n","authors":["an Luo","Xin Xu","Yue Liu","Panupong Pasupat","Mehran Kazemi"],"pdf_url":"https://arxiv.org/pdf/2401.11624v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11601v1","updated":"2024-01-21T21:21:51Z","published":"2024-01-21T21:21:51Z","title":"Robust Evaluation Measures for Evaluating Social Biases in Masked\n Language Models","summary":" Many evaluation measures are used to evaluate social biases in masked\nlanguage models (MLMs). However, we find that these previously proposed\nevaluation measures are lacking robustness in scenarios with limited datasets.\nThis is because these measures are obtained by comparing the\npseudo-log-likelihood (PLL) scores of the stereotypical and anti-stereotypical\nsamples using an indicator function. The disadvantage is the limited mining of\nthe PLL score sets without capturing its distributional information. In this\npaper, we represent a PLL score set as a Gaussian distribution and use Kullback\nLeibler (KL) divergence and Jensen Shannon (JS) divergence to construct\nevaluation measures for the distributions of stereotypical and\nanti-stereotypical PLL scores. Experimental results on the publicly available\ndatasets StereoSet (SS) and CrowS-Pairs (CP) show that our proposed measures\nare significantly more robust and interpretable than those proposed previously.\n","authors":["Yang Liu"],"pdf_url":"https://arxiv.org/pdf/2401.11601v1.pdf","comment":"9 pages, 5 figures"},{"id":"http://arxiv.org/abs/2310.01361v2","updated":"2024-01-21T21:01:12Z","published":"2023-10-02T17:23:48Z","title":"GenSim: Generating Robotic Simulation Tasks via Large Language Models","summary":" Collecting large amounts of real-world interaction data to train general\nrobotic policies is often prohibitively expensive, thus motivating the use of\nsimulation data. However, existing methods for data generation have generally\nfocused on scene-level diversity (e.g., object instances and poses) rather than\ntask-level diversity, due to the human effort required to come up with and\nverify novel tasks. This has made it challenging for policies trained on\nsimulation data to demonstrate significant task-level generalization. In this\npaper, we propose to automatically generate rich simulation environments and\nexpert demonstrations by exploiting a large language models' (LLM) grounding\nand coding ability. Our approach, dubbed GenSim, has two modes: goal-directed\ngeneration, wherein a target task is given to the LLM and the LLM proposes a\ntask curriculum to solve the target task, and exploratory generation, wherein\nthe LLM bootstraps from previous tasks and iteratively proposes novel tasks\nthat would be helpful in solving more complex tasks. We use GPT4 to expand the\nexisting benchmark by ten times to over 100 tasks, on which we conduct\nsupervised finetuning and evaluate several LLMs including finetuned GPTs and\nCode Llama on code generation for robotic simulation tasks. Furthermore, we\nobserve that LLMs-generated simulation programs can enhance task-level\ngeneralization significantly when used for multitask policy training. We\nfurther find that with minimal sim-to-real adaptation, the multitask policies\npretrained on GPT4-generated simulation tasks exhibit stronger transfer to\nunseen long-horizon tasks in the real world and outperform baselines by 25%.\nSee the project website (https://liruiw.github.io/gensim) for code, demos, and\nvideos.\n","authors":["Lirui Wang","Yiyang Ling","Zhecheng Yuan","Mohit Shridhar","Chen Bao","Yuzhe Qin","Bailin Wang","Huazhe Xu","Xiaolong Wang"],"pdf_url":"https://arxiv.org/pdf/2310.01361v2.pdf","comment":"See our project website (https://liruiw.github.io/gensim), demo and\n datasets (https://huggingface.co/spaces/Gen-Sim/Gen-Sim), and code\n (https://github.com/liruiw/GenSim) for more details"},{"id":"http://arxiv.org/abs/2309.12244v2","updated":"2024-01-21T16:30:35Z","published":"2023-09-21T16:43:17Z","title":"ChaCha: Leveraging Large Language Models to Prompt Children to Share\n Their Emotions about Personal Events","summary":" Children typically learn to identify and express emotions through sharing\ntheir stories and feelings with others, particularly their family. However, it\nis challenging for parents or siblings to have emotional communication with\nchildren since children are still developing their communication skills. We\npresent ChaCha, a chatbot that encourages and guides children to share personal\nevents and associated emotions. ChaCha combines a state machine and large\nlanguage models (LLMs) to keep the dialogue on track while carrying on\nfree-form conversations. Through an exploratory study with 20 children (aged\n8-12), we examine how ChaCha prompts children to share personal events and\nguides them to describe associated emotions. Participants perceived ChaCha as a\nclose friend and shared their stories on various topics, such as family trips\nand personal achievements. Based on the findings, we discuss opportunities for\nleveraging LLMs to design child-friendly chatbots to support children in\nsharing emotions.\n","authors":["Woosuk Seo","Chanmo Yang","Young-Ho Kim"],"pdf_url":"https://arxiv.org/pdf/2309.12244v2.pdf","comment":"16 pages, 5 figures, 2 tables; Accepted at ACM CHI 2024"},{"id":"http://arxiv.org/abs/2401.09074v2","updated":"2024-01-21T15:15:30Z","published":"2024-01-17T09:23:59Z","title":"Code Simulation Challenges for Large Language Models","summary":" We investigate the extent to which Large Language Models (LLMs) can simulate\nthe execution of computer code and algorithms. We begin by looking at straight\nline programs, and show that current LLMs demonstrate poor performance even\nwith such simple programs -- performance rapidly degrades with the length of\ncode. We then investigate the ability of LLMs to simulate programs that contain\ncritical paths and redundant instructions. We also go beyond straight line\nprogram simulation with sorting algorithms and nested loops, and we show the\ncomputational complexity of a routine directly affects the ability of an LLM to\nsimulate its execution. We observe that LLMs execute instructions sequentially\nand with a low error margin only for short programs or standard procedures.\nLLMs' code simulation is in tension with their pattern recognition and\nmemorisation capabilities: on tasks where memorisation is detrimental, we\npropose a novel prompting method to simulate code execution line by line.\nEmpirically, our new Chain of Simulation (CoSm) method improves on the standard\nChain of Thought prompting approach by avoiding the pitfalls of memorisation.\n","authors":["Emanuele La Malfa","Christoph Weinhuber","Orazio Torre","Fangru Lin","Anthony Cohn","Nigel Shadbolt","Michael Wooldridge"],"pdf_url":"https://arxiv.org/pdf/2401.09074v2.pdf","comment":"main paper (10 pages) + Appendix (11 pages)"},{"id":"http://arxiv.org/abs/2302.12584v2","updated":"2024-01-21T14:51:26Z","published":"2023-02-24T11:44:24Z","title":"VivesDebate-Speech: A Corpus of Spoken Argumentation to Leverage Audio\n Features for Argument Mining","summary":" In this paper, we describe VivesDebate-Speech, a corpus of spoken\nargumentation created to leverage audio features for argument mining tasks. The\ncreation of this corpus represents an important contribution to the\nintersection of speech processing and argument mining communities, and one of\nthe most complete publicly available resources in this topic. Moreover, we have\nperformed a set of first-of-their-kind experiments which show an improvement\nwhen integrating audio features into the argument mining pipeline. The provided\nresults can be used as a baseline for future research.\n","authors":["Ramon Ruiz-Dolz","Javier Iranzo-Sánchez"],"pdf_url":"https://arxiv.org/pdf/2302.12584v2.pdf","comment":"5 pages; EMNLP 2023 Accepted Version"},{"id":"http://arxiv.org/abs/2203.14647v2","updated":"2024-01-21T14:39:30Z","published":"2022-03-28T11:09:07Z","title":"Automatic Debate Evaluation with Argumentation Semantics and Natural\n Language Argument Graph Networks","summary":" The lack of annotated data on professional argumentation and complete\nargumentative debates has led to the oversimplification and the inability of\napproaching more complex natural language processing tasks. Such is the case of\nthe automatic debate evaluation. In this paper, we propose an original hybrid\nmethod to automatically evaluate argumentative debates. For that purpose, we\ncombine concepts from argumentation theory such as argumentation frameworks and\nsemantics, with Transformer-based architectures and neural graph networks.\nFurthermore, we obtain promising results that lay the basis on an unexplored\nnew instance of the automatic analysis of natural language arguments.\n","authors":["Ramon Ruiz-Dolz","Stella Heras","Ana García-Fornes"],"pdf_url":"https://arxiv.org/pdf/2203.14647v2.pdf","comment":"EMNLP 2023 Accepted Version"},{"id":"http://arxiv.org/abs/2401.11505v1","updated":"2024-01-21T14:30:20Z","published":"2024-01-21T14:30:20Z","title":"CheX-GPT: Harnessing Large Language Models for Enhanced Chest X-ray\n Report Labeling","summary":" Free-text radiology reports present a rich data source for various medical\ntasks, but effectively labeling these texts remains challenging. Traditional\nrule-based labeling methods fall short of capturing the nuances of diverse\nfree-text patterns. Moreover, models using expert-annotated data are limited by\ndata scarcity and pre-defined classes, impacting their performance, flexibility\nand scalability. To address these issues, our study offers three main\ncontributions: 1) We demonstrate the potential of GPT as an adept labeler using\ncarefully designed prompts. 2) Utilizing only the data labeled by GPT, we\ntrained a BERT-based labeler, CheX-GPT, which operates faster and more\nefficiently than its GPT counterpart. 3) To benchmark labeler performance, we\nintroduced a publicly available expert-annotated test set, MIMIC-500,\ncomprising 500 cases from the MIMIC validation set. Our findings demonstrate\nthat CheX-GPT not only excels in labeling accuracy over existing models, but\nalso showcases superior efficiency, flexibility, and scalability, supported by\nour introduction of the MIMIC-500 dataset for robust benchmarking. Code and\nmodels are available at https://github.com/kakaobrain/CheXGPT.\n","authors":["Jawook Gu","Han-Cheol Cho","Jiho Kim","Kihyun You","Eun Kyoung Hong","Byungseok Roh"],"pdf_url":"https://arxiv.org/pdf/2401.11505v1.pdf","comment":"16 pages, 3 figures"},{"id":"http://arxiv.org/abs/2401.11504v1","updated":"2024-01-21T14:28:41Z","published":"2024-01-21T14:28:41Z","title":"With Greater Text Comes Greater Necessity: Inference-Time Training Helps\n Long Text Generation","summary":" Long text generation, such as novel writing or discourse-level translation\nwith extremely long contexts, presents significant challenges to current\nlanguage models. Existing methods mainly focus on extending the model's context\nwindow through strategies like length extrapolation. However, these approaches\ndemand substantial hardware resources during the training and/or inference\nphases. Our proposed method, Temp-Lora, introduces an alternative concept.\nInstead of relying on the KV cache to store all context information, Temp-Lora\nembeds this information directly into the model's parameters. In the process of\nlong text generation, we use a temporary Lora module, progressively trained\nwith text generated previously. This approach not only efficiently preserves\ncontextual knowledge but also prevents any permanent alteration to the model's\nparameters given that the module is discarded post-generation. Extensive\nexperiments on the PG19 language modeling benchmark and the GuoFeng\ndiscourse-level translation benchmark validate the effectiveness of Temp-Lora.\nOur results show that: 1) Temp-Lora substantially enhances generation quality\nfor long texts, as indicated by a 13.2% decrease in perplexity on a subset of\nPG19, and a 29.6% decrease in perplexity along with a 53.2% increase in BLEU\nscore on GuoFeng, 2) Temp-Lora is compatible with and enhances most existing\nlong text generation methods, and 3) Temp-Lora can greatly reduce computational\ncosts by shortening the context window. While ensuring a slight improvement in\ngeneration quality (a decrease of 3.8% in PPL), it enables a reduction of 70.5%\nin the FLOPs required for inference and a 51.5% decrease in latency.\n","authors":["Y. Wang","D. Ma","D. Cai"],"pdf_url":"https://arxiv.org/pdf/2401.11504v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.05173v4","updated":"2024-01-21T13:38:20Z","published":"2023-09-11T00:02:05Z","title":"DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning","summary":" Prompt tuning (PT), where a small amount of trainable soft (continuous)\nprompt vectors is affixed to the input of language models (LM), has shown\npromising results across various tasks and models for parameter-efficient\nfine-tuning (PEFT). PT stands out from other PEFT approaches because it\nmaintains competitive performance with fewer trainable parameters and does not\ndrastically scale up its parameters as the model size expands. However, PT\nintroduces additional soft prompt tokens, leading to longer input sequences,\nwhich significantly impacts training and inference time and memory usage due to\nthe Transformer's quadratic complexity. Particularly concerning for Large\nLanguage Models (LLMs) that face heavy daily querying. To address this issue,\nwe propose Decomposed Prompt Tuning (DePT), which decomposes the soft prompt\ninto a shorter soft prompt and a pair of low-rank matrices that are then\noptimised with two different learning rates. This allows DePT to achieve better\nperformance while saving substantial memory and time costs compared to vanilla\nPT and its variants, without changing trainable parameter sizes. Through\nextensive experiments on 23 natural language processing (NLP) and\nvision-language (VL) tasks, we demonstrate that DePT outperforms\nstate-of-the-art PEFT approaches, including the full fine-tuning baseline, in\nsome scenarios. Additionally, we empirically show that DEPT grows more\nefficient as the model size increases. Our further study reveals that DePT\nintegrates seamlessly with parameter-efficient transfer learning in the\nfew-shot learning setting and highlights its adaptability to various model\narchitectures and sizes.\n","authors":["Zhengxiang Shi","Aldo Lipani"],"pdf_url":"https://arxiv.org/pdf/2309.05173v4.pdf","comment":"ICLR 2024. Code is available at https://github.com/ZhengxiangShi/DePT"},{"id":"http://arxiv.org/abs/2401.11487v1","updated":"2024-01-21T13:18:20Z","published":"2024-01-21T13:18:20Z","title":"Towards Better Inclusivity: A Diverse Tweet Corpus of English Varieties","summary":" The prevalence of social media presents a growing opportunity to collect and\nanalyse examples of English varieties. Whilst usage of these varieties was -\nand, in many cases, still is - used only in spoken contexts or hard-to-access\nprivate messages, social media sites like Twitter provide a platform for users\nto communicate informally in a scrapeable format. Notably, Indian English\n(Hinglish), Singaporean English (Singlish), and African-American English (AAE)\ncan be commonly found online. These varieties pose a challenge to existing\nnatural language processing (NLP) tools as they often differ orthographically\nand syntactically from standard English for which the majority of these tools\nare built. NLP models trained on standard English texts produced biased\noutcomes for users of underrepresented varieties. Some research has aimed to\novercome the inherent biases caused by unrepresentative data through techniques\nlike data augmentation or adjusting training models.\n We aim to address the issue of bias at its root - the data itself. We curate\na dataset of tweets from countries with high proportions of underserved English\nvariety speakers, and propose an annotation framework of six categorical\nclassifications along a pseudo-spectrum that measures the degree of standard\nEnglish and that thereby indirectly aims to surface the manifestations of\nEnglish varieties in these tweets. Following best annotation practices, our\ngrowing corpus features 170,800 tweets taken from 7 countries, labeled by\nannotators who are from those countries and can communicate in\nregionally-dominant varieties of English. Our corpus highlights the accuracy\ndiscrepancies in pre-trained language identifiers between western English and\nnon-western (i.e., less standard) English varieties. We hope to contribute to\nthe growing literature identifying and reducing the implicit demographic\ndiscrepancies in NLP.\n","authors":["Nhi Pham","Lachlan Pham","Adam L. Meyers"],"pdf_url":"https://arxiv.org/pdf/2401.11487v1.pdf","comment":"10 pages (including limitations, references and appendices), 2\n figures"},{"id":"http://arxiv.org/abs/2310.15823v3","updated":"2024-01-21T12:40:48Z","published":"2023-10-24T13:23:57Z","title":"Rosetta Stone at KSAA-RD Shared Task: A Hop From Language Modeling To\n Word--Definition Alignment","summary":" A Reverse Dictionary is a tool enabling users to discover a word based on its\nprovided definition, meaning, or description. Such a technique proves valuable\nin various scenarios, aiding language learners who possess a description of a\nword without its identity, and benefiting writers seeking precise terminology.\nThese scenarios often encapsulate what is referred to as the\n\"Tip-of-the-Tongue\" (TOT) phenomena. In this work, we present our winning\nsolution for the Arabic Reverse Dictionary shared task. This task focuses on\nderiving a vector representation of an Arabic word from its accompanying\ndescription. The shared task encompasses two distinct subtasks: the first\ninvolves an Arabic definition as input, while the second employs an English\ndefinition. For the first subtask, our approach relies on an ensemble of\nfinetuned Arabic BERT-based models, predicting the word embedding for a given\ndefinition. The final representation is obtained through averaging the output\nembeddings from each model within the ensemble. In contrast, the most effective\nsolution for the second subtask involves translating the English test\ndefinitions into Arabic and applying them to the finetuned models originally\ntrained for the first subtask. This straightforward method achieves the highest\nscore across both subtasks.\n","authors":["Ahmed ElBakry","Mohamed Gabr","Muhammad ElNokrashy","Badr AlKhamissi"],"pdf_url":"https://arxiv.org/pdf/2310.15823v3.pdf","comment":"Proceedings of ArabicNLP 2023"},{"id":"http://arxiv.org/abs/2401.11467v1","updated":"2024-01-21T11:42:18Z","published":"2024-01-21T11:42:18Z","title":"Over-Reasoning and Redundant Calculation of Large Language Models","summary":" Large language models (LLMs) can solve problems step-by-step. While this\nchain-of-thought (CoT) reasoning boosts LLMs' performance, it is unclear if\nLLMs \\textit{know} when to use CoT and whether those CoT are always necessary\nto answer the question. This paper shows that LLMs tend to generate redundant\ncalculations and reasoning on a manually constructed math QA dataset,\nGSM8K-Zero. GSM8K-Zero is constructed such that the questions can be answered\nwithout any calculations, but LLMs, including Llama-2 models and Claude-2, tend\nto generate lengthy and unnecessary calculations to answer the questions. We\nalso conduct experiments to explain why LLMs generate redundant calculations\nand reasonings. GSM8K-Zero is publicly available at\nhttps://github.com/d223302/Over-Reasoning-of-LLMs and\nhttps://huggingface.co/datasets/dcml0714/GSM8K-Zero.\n","authors":["Cheng-Han Chiang","Hung-yi Lee"],"pdf_url":"https://arxiv.org/pdf/2401.11467v1.pdf","comment":"EACL 2024 main conference paper. Camera-ready version"},{"id":"http://arxiv.org/abs/2401.11463v1","updated":"2024-01-21T11:04:30Z","published":"2024-01-21T11:04:30Z","title":"Estimating the Usefulness of Clarifying Questions and Answers for\n Conversational Search","summary":" While the body of research directed towards constructing and generating\nclarifying questions in mixed-initiative conversational search systems is vast,\nresearch aimed at processing and comprehending users' answers to such questions\nis scarce. To this end, we present a simple yet effective method for processing\nanswers to clarifying questions, moving away from previous work that simply\nappends answers to the original query and thus potentially degrades retrieval\nperformance. Specifically, we propose a classifier for assessing usefulness of\nthe prompted clarifying question and an answer given by the user. Useful\nquestions or answers are further appended to the conversation history and\npassed to a transformer-based query rewriting module. Results demonstrate\nsignificant improvements over strong non-mixed-initiative baselines.\nFurthermore, the proposed approach mitigates the performance drops when non\nuseful questions and answers are utilized.\n","authors":["Ivan Sekulić","Weronika Łajewska","Krisztian Balog","Fabio Crestani"],"pdf_url":"https://arxiv.org/pdf/2401.11463v1.pdf","comment":"This is the author's version of the work. The definitive version is\n published in: Proceedings of the 46th European Conference on Information\n Retrieval (ECIR '24), March 24-28, 2024, Glasgow, Scotland"},{"id":"http://arxiv.org/abs/2401.11458v1","updated":"2024-01-21T10:46:23Z","published":"2024-01-21T10:46:23Z","title":"Linear Alignment: A Closed-form Solution for Aligning Human Preferences\n without Tuning and Feedback","summary":" The success of AI assistants based on Language Models (LLMs) hinges on\nReinforcement Learning from Human Feedback (RLHF) to comprehend and align with\nuser intentions. However, traditional alignment algorithms, such as PPO, are\nhampered by complex annotation and training requirements. This reliance limits\nthe applicability of RLHF and hinders the development of professional\nassistants tailored to diverse human preferences. In this work, we introduce\n\\textit{Linear Alignment}, a novel algorithm that aligns language models with\nhuman preferences in one single inference step, eliminating the reliance on\ndata annotation and model training. Linear alignment incorporates a new\nparameterization for policy optimization under divergence constraints, which\nenables the extraction of optimal policy in a closed-form manner and\nfacilitates the direct estimation of the aligned response. Extensive\nexperiments on both general and personalized preference datasets demonstrate\nthat linear alignment significantly enhances the performance and efficiency of\nLLM alignment across diverse scenarios. Our code and dataset will be published\non \\url{https://github.com/Wizardcoast/Linear_Alignment.git}.\n","authors":["Songyang Gao","Qiming Ge","Wei Shen","Shihan Dou","Junjie Ye","Xiao Wang","Rui Zheng","Yicheng Zou","Zhi Chen","Hang Yan","Qi Zhang","Dahua Lin"],"pdf_url":"https://arxiv.org/pdf/2401.11458v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11452v1","updated":"2024-01-21T10:15:36Z","published":"2024-01-21T10:15:36Z","title":"Towards Reliable and Factual Response Generation: Detecting Unanswerable\n Questions in Information-Seeking Conversations","summary":" Generative AI models face the challenge of hallucinations that can undermine\nusers' trust in such systems. We approach the problem of conversational\ninformation seeking as a two-step process, where relevant passages in a corpus\nare identified first and then summarized into a final system response. This way\nwe can automatically assess if the answer to the user's question is present in\nthe corpus. Specifically, our proposed method employs a sentence-level\nclassifier to detect if the answer is present, then aggregates these\npredictions on the passage level, and eventually across the top-ranked passages\nto arrive at a final answerability estimate. For training and evaluation, we\ndevelop a dataset based on the TREC CAsT benchmark that includes answerability\nlabels on the sentence, passage, and ranking levels. We demonstrate that our\nproposed method represents a strong baseline and outperforms a state-of-the-art\nLLM on the answerability prediction task.\n","authors":["Weronika Łajewska","Krisztian Balog"],"pdf_url":"https://arxiv.org/pdf/2401.11452v1.pdf","comment":"This is the author's version of the work. The definitive version is\n published in: Proceedings of the 46th European Conference on Information\n Retrieval} (ECIR '24), March 24--28, 2024, Glasgow, Scotland"},{"id":"http://arxiv.org/abs/2312.11532v2","updated":"2024-01-21T09:30:36Z","published":"2023-12-15T15:01:10Z","title":"Topic-VQ-VAE: Leveraging Latent Codebooks for Flexible Topic-Guided\n Document Generation","summary":" This paper introduces a novel approach for topic modeling utilizing latent\ncodebooks from Vector-Quantized Variational Auto-Encoder~(VQ-VAE), discretely\nencapsulating the rich information of the pre-trained embeddings such as the\npre-trained language model. From the novel interpretation of the latent\ncodebooks and embeddings as conceptual bag-of-words, we propose a new\ngenerative topic model called Topic-VQ-VAE~(TVQ-VAE) which inversely generates\nthe original documents related to the respective latent codebook. The TVQ-VAE\ncan visualize the topics with various generative distributions including the\ntraditional BoW distribution and the autoregressive image generation. Our\nexperimental results on document analysis and image generation demonstrate that\nTVQ-VAE effectively captures the topic context which reveals the underlying\nstructures of the dataset and supports flexible forms of document generation.\nOfficial implementation of the proposed TVQ-VAE is available at\nhttps://github.com/clovaai/TVQ-VAE.\n","authors":["YoungJoon Yoo","Jongwon Choi"],"pdf_url":"https://arxiv.org/pdf/2312.11532v2.pdf","comment":"Published in the 38th annual AAAI conference on Artificial\n Intelligence"},{"id":"http://arxiv.org/abs/2401.11431v1","updated":"2024-01-21T08:43:24Z","published":"2024-01-21T08:43:24Z","title":"Majority or Minority: Data Imbalance Learning Method for Named Entity\n Recognition","summary":" Data imbalance presents a significant challenge in various machine learning\n(ML) tasks, particularly named entity recognition (NER) within natural language\nprocessing (NLP). NER exhibits a data imbalance with a long-tail distribution,\nfeaturing numerous minority classes (i.e., entity classes) and a single\nmajority class (i.e., O-class). The imbalance leads to the misclassifications\nof the entity classes as the O-class. To tackle the imbalance, we propose a\nsimple and effective learning method, named majority or minority (MoM)\nlearning. MoM learning incorporates the loss computed only for samples whose\nground truth is the majority class (i.e., the O-class) into the loss of the\nconventional ML model. Evaluation experiments on four NER datasets (Japanese\nand English) showed that MoM learning improves prediction performance of the\nminority classes, without sacrificing the performance of the majority class and\nis more effective than widely known and state-of-the-art methods. We also\nevaluated MoM learning using frameworks as sequential labeling and machine\nreading comprehension, which are commonly used in NER. Furthermore, MoM\nlearning has achieved consistent performance improvements regardless of\nlanguage, model, or framework.\n","authors":["Sota Nemoto","Shunsuke Kitada","Hitoshi Iyatomi"],"pdf_url":"https://arxiv.org/pdf/2401.11431v1.pdf","comment":"6 pages, 1 figures, 6 tables"},{"id":"http://arxiv.org/abs/2302.06419v2","updated":"2024-01-21T07:41:02Z","published":"2023-02-10T02:55:52Z","title":"AV-data2vec: Self-supervised Learning of Audio-Visual Speech\n Representations with Contextualized Target Representations","summary":" Self-supervision has shown great potential for audio-visual speech\nrecognition by vastly reducing the amount of labeled data required to build\ngood systems. However, existing methods are either not entirely end-to-end or\ndo not train joint representations of both modalities. In this paper, we\nintroduce AV-data2vec which addresses these challenges and builds audio-visual\nrepresentations based on predicting contextualized representations which has\nbeen successful in the uni-modal case. The model uses a shared transformer\nencoder for both audio and video and can combine both modalities to improve\nspeech recognition. Results on LRS3 show that AV-data2vec consistently\noutperforms existing methods under all settings with the same amount of data\nand model size.\n","authors":["Jiachen Lian","Alexei Baevski","Wei-Ning Hsu","Michael Auli"],"pdf_url":"https://arxiv.org/pdf/2302.06419v2.pdf","comment":"2023 ASRU"},{"id":"http://arxiv.org/abs/2401.10015v2","updated":"2024-01-21T06:51:25Z","published":"2024-01-18T14:33:01Z","title":"Towards Hierarchical Spoken Language Dysfluency Modeling","summary":" Speech disfluency modeling is the bottleneck for both speech therapy and\nlanguage learning. However, there is no effective AI solution to systematically\ntackle this problem. We solidify the concept of disfluent speech and disfluent\nspeech modeling. We then present Hierarchical Unconstrained Disfluency Modeling\n(H-UDM) approach, the hierarchical extension of UDM that addresses both\ndisfluency transcription and detection to eliminate the need for extensive\nmanual annotation. Our experimental findings serve as clear evidence of the\neffectiveness and reliability of the methods we have introduced, encompassing\nboth transcription and detection tasks.\n","authors":["Jiachen Lian","Gopala Anumanchipalli"],"pdf_url":"https://arxiv.org/pdf/2401.10015v2.pdf","comment":"2024 EACL. Hierarchical extension of our previous workshop paper\n arXiv:2312.12810"},{"id":"http://arxiv.org/abs/2401.11408v1","updated":"2024-01-21T06:10:03Z","published":"2024-01-21T06:10:03Z","title":"SEBERTNets: Sequence Enhanced BERT Networks for Event Entity Extraction\n Tasks Oriented to the Finance Field","summary":" Event extraction lies at the cores of investment analysis and asset\nmanagement in the financial field, and thus has received much attention. The\n2019 China conference on knowledge graph and semantic computing (CCKS)\nchallenge sets up a evaluation competition for event entity extraction task\noriented to the finance field. In this task, we mainly focus on how to extract\nthe event entity accurately, and recall all the corresponding event entity\neffectively. In this paper, we propose a novel model, Sequence Enhanced BERT\nNetworks (SEBERTNets for short), which can inherit the advantages of the\nBERT,and while capturing sequence semantic information. In addition, motivated\nby recommendation system, we propose Hybrid Sequence Enhanced BERT Networks\n(HSEBERTNets for short), which uses a multi-channel recall method to recall all\nthe corresponding event entity. The experimental results show that, the F1\nscore of SEBERTNets is 0.905 in the first stage, and the F1 score of\nHSEBERTNets is 0.934 in the first stage, which demonstarate the effectiveness\nof our methods.\n","authors":["Congqing He","Xiangyu Zhu","Yuquan Le","Yuzhong Liu","Jianhong Yin"],"pdf_url":"https://arxiv.org/pdf/2401.11408v1.pdf","comment":"CCKS 2019"},{"id":"http://arxiv.org/abs/2312.07930v2","updated":"2024-01-21T05:22:22Z","published":"2023-12-13T06:57:00Z","title":"Towards Optimal Statistical Watermarking","summary":" We study statistical watermarking by formulating it as a hypothesis testing\nproblem, a general framework which subsumes all previous statistical\nwatermarking methods. Key to our formulation is a coupling of the output tokens\nand the rejection region, realized by pseudo-random generators in practice,\nthat allows non-trivial trade-off between the Type I error and Type II error.\nWe characterize the Uniformly Most Powerful (UMP) watermark in the general\nhypothesis testing setting and the minimax Type II error in the model-agnostic\nsetting. In the common scenario where the output is a sequence of $n$ tokens,\nwe establish nearly matching upper and lower bounds on the number of i.i.d.\ntokens required to guarantee small Type I and Type II errors. Our rate of\n$\\Theta(h^{-1} \\log (1/h))$ with respect to the average entropy per token $h$\nhighlights potentials for improvement from the rate of $h^{-2}$ in the previous\nworks. Moreover, we formulate the robust watermarking problem where users are\nallowed to perform a class of perturbations on the generated texts, and\ncharacterize the optimal type II error of robust UMP tests via a linear\nprogramming problem. To the best of our knowledge, this is the first systematic\nstatistical treatment on the watermarking problem with near-optimal rates in\nthe i.i.d. setting, which might be of interest for future works.\n","authors":["Baihe Huang","Banghua Zhu","Hanlin Zhu","Jason D. Lee","Jiantao Jiao","Michael I. Jordan"],"pdf_url":"https://arxiv.org/pdf/2312.07930v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11403v1","updated":"2024-01-21T04:54:45Z","published":"2024-01-21T04:54:45Z","title":"MolTailor: Tailoring Chemical Molecular Representation to Specific Tasks\n via Text Prompts","summary":" Deep learning is now widely used in drug discovery, providing significant\nacceleration and cost reduction. As the most fundamental building block,\nmolecular representation is essential for predicting molecular properties to\nenable various downstream applications. Most existing methods attempt to\nincorporate more information to learn better representations. However, not all\nfeatures are equally important for a specific task. Ignoring this would\npotentially compromise the training efficiency and predictive accuracy. To\naddress this issue, we propose a novel approach, which treats language models\nas an agent and molecular pretraining models as a knowledge base. The agent\naccentuates task-relevant features in the molecular representation by\nunderstanding the natural language description of the task, just as a tailor\ncustomizes clothes for clients. Thus, we call this approach MolTailor.\nEvaluations demonstrate MolTailor's superior performance over baselines,\nvalidating the efficacy of enhancing relevance for molecular representation\nlearning. This illustrates the potential of language model guided optimization\nto better exploit and unleash the capabilities of existing powerful molecular\nrepresentation methods. Our codes and appendix are available at\nhttps://github.com/SCIR-HI/MolTailor.\n","authors":["Haoqiang Guo","Sendong Zhao","Haochun Wang","Yanrui Du","Bing Qin"],"pdf_url":"https://arxiv.org/pdf/2401.11403v1.pdf","comment":"Accepted by AAAI 2024"},{"id":"http://arxiv.org/abs/2310.02255v3","updated":"2024-01-21T03:47:06Z","published":"2023-10-03T17:57:24Z","title":"MathVista: Evaluating Mathematical Reasoning of Foundation Models in\n Visual Contexts","summary":" Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit\nimpressive problem-solving skills in many tasks and domains, but their ability\nin mathematical reasoning in visual contexts has not been systematically\nstudied. To bridge this gap, we present MathVista, a benchmark designed to\ncombine challenges from diverse mathematical and visual tasks. It consists of\n6,141 examples, derived from 28 existing multimodal datasets involving\nmathematics and 3 newly created datasets (i.e., IQTest, FunctionQA, and\nPaperQA). Completing these tasks requires fine-grained, deep visual\nunderstanding and compositional reasoning, which all state-of-the-art\nfoundation models find challenging. With MathVista, we have conducted a\ncomprehensive, quantitative evaluation of 12 prominent foundation models. The\nbest-performing GPT-4V model achieves an overall accuracy of 49.9%,\nsubstantially outperforming Bard, the second-best performer, by 15.1%. Our\nin-depth analysis reveals that the superiority of GPT-4V is mainly attributed\nto its enhanced visual perception and mathematical reasoning. However, GPT-4V\nstill falls short of human performance by 10.4%, as it often struggles to\nunderstand complex figures and perform rigorous reasoning. This significant gap\nunderscores the critical role that MathVista will play in the development of\ngeneral-purpose AI agents capable of tackling mathematically intensive and\nvisually rich real-world tasks. We further explore the new ability of\nself-verification, the application of self-consistency, and the interactive\nchatbot capabilities of GPT-4V, highlighting its promising potential for future\nresearch. The project is available at https://mathvista.github.io/.\n","authors":["Pan Lu","Hritik Bansal","Tony Xia","Jiacheng Liu","Chunyuan Li","Hannaneh Hajishirzi","Hao Cheng","Kai-Wei Chang","Michel Galley","Jianfeng Gao"],"pdf_url":"https://arxiv.org/pdf/2310.02255v3.pdf","comment":"116 pages, 120 figures. Accepted to ICLR 2024"},{"id":"http://arxiv.org/abs/2401.11389v1","updated":"2024-01-21T03:37:47Z","published":"2024-01-21T03:37:47Z","title":"MedLM: Exploring Language Models for Medical Question Answering Systems","summary":" In the face of rapidly expanding online medical literature, automated systems\nfor aggregating and summarizing information are becoming increasingly crucial\nfor healthcare professionals and patients. Large Language Models (LLMs), with\ntheir advanced generative capabilities, have shown promise in various NLP\ntasks, and their potential in the healthcare domain, particularly for\nClosed-Book Generative QnA, is significant. However, the performance of these\nmodels in domain-specific tasks such as medical Q&A remains largely unexplored.\nThis study aims to fill this gap by comparing the performance of general and\nmedical-specific distilled LMs for medical Q&A. We aim to evaluate the\neffectiveness of fine-tuning domain-specific LMs and compare the performance of\ndifferent families of Language Models. The study will address critical\nquestions about these models' reliability, comparative performance, and\neffectiveness in the context of medical Q&A. The findings will provide valuable\ninsights into the suitability of different LMs for specific applications in the\nmedical domain.\n","authors":["Niraj Yagnik","Jay Jhaveri","Vivek Sharma","Gabriel Pila","Asma Ben","Jingbo Shang"],"pdf_url":"https://arxiv.org/pdf/2401.11389v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10189v2","updated":"2024-01-21T03:37:41Z","published":"2024-01-18T18:20:15Z","title":"Chem-FINESE: Validating Fine-Grained Few-shot Entity Extraction through\n Text Reconstruction","summary":" Fine-grained few-shot entity extraction in the chemical domain faces two\nunique challenges. First, compared with entity extraction tasks in the general\ndomain, sentences from chemical papers usually contain more entities. Moreover,\nentity extraction models usually have difficulty extracting entities of\nlong-tailed types. In this paper, we propose Chem-FINESE, a novel\nsequence-to-sequence (seq2seq) based few-shot entity extraction approach, to\naddress these two challenges. Our Chem-FINESE has two components: a seq2seq\nentity extractor to extract named entities from the input sentence and a\nseq2seq self-validation module to reconstruct the original input sentence from\nextracted entities. Inspired by the fact that a good entity extraction system\nneeds to extract entities faithfully, our new self-validation module leverages\nentity extraction results to reconstruct the original input sentence. Besides,\nwe design a new contrastive loss to reduce excessive copying during the\nextraction process. Finally, we release ChemNER+, a new fine-grained chemical\nentity extraction dataset that is annotated by domain experts with the ChemNER\nschema. Experiments in few-shot settings with both ChemNER+ and CHEMET datasets\nshow that our newly proposed framework has contributed up to 8.26% and 6.84%\nabsolute F1-score gains respectively.\n","authors":["Qingyun Wang","Zixuan Zhang","Hongxiang Li","Xuan Liu","Jiawei Han","Heng Ji","Huimin Zhao"],"pdf_url":"https://arxiv.org/pdf/2401.10189v2.pdf","comment":"16 pages. Accepted by Findings of the Association for Computational\n Linguistics: EACL 2024. Code and resources are available at\n https://github.com/EagleW/Chem-FINESE"},{"id":"http://arxiv.org/abs/2401.11382v1","updated":"2024-01-21T03:15:05Z","published":"2024-01-21T03:15:05Z","title":"Using Large Language Model for End-to-End Chinese ASR and NER","summary":" Mapping speech tokens to the same feature space as text tokens has become the\nparadigm for the integration of speech modality into decoder-only large\nlanguage models (LLMs). An alternative approach is to use an encoder-decoder\narchitecture that incorporates speech features through cross-attention. This\napproach, however, has received less attention in the literature. In this work,\nwe connect the Whisper encoder with ChatGLM3 and provide in-depth comparisons\nof these two approaches using Chinese automatic speech recognition (ASR) and\nname entity recognition (NER) tasks. We evaluate them not only by conventional\nmetrics like the F1 score but also by a novel fine-grained taxonomy of ASR-NER\nerrors. Our experiments reveal that encoder-decoder architecture outperforms\ndecoder-only architecture with a short context, while decoder-only architecture\nbenefits from a long context as it fully exploits all layers of the LLM. By\nusing LLM, we significantly reduced the entity omission errors and improved the\nentity ASR accuracy compared to the Conformer baseline. Additionally, we\nobtained a state-of-the-art (SOTA) F1 score of 0.805 on the AISHELL-NER test\nset by using chain-of-thought (CoT) NER which first infers long-form ASR\ntranscriptions and then predicts NER labels.\n","authors":["Yuang Li","Jiawei Yu","Yanqing Zhao","Min Zhang","Mengxin Ren","Xiaofeng Zhao","Xiaosong Qiao","Chang Su","Miaomiao Ma","Hao Yang"],"pdf_url":"https://arxiv.org/pdf/2401.11382v1.pdf","comment":"5 pages, 2 figures"},{"id":"http://arxiv.org/abs/2401.11374v1","updated":"2024-01-21T02:29:12Z","published":"2024-01-21T02:29:12Z","title":"Language Models as Hierarchy Encoders","summary":" Interpreting hierarchical structures latent in language is a key limitation\nof current language models (LMs). While previous research has implicitly\nleveraged these hierarchies to enhance LMs, approaches for their explicit\nencoding are yet to be explored. To address this, we introduce a novel approach\nto re-train transformer encoder-based LMs as Hierarchy Transformer encoders\n(HiTs), harnessing the expansive nature of hyperbolic space. Our method\nsituates the output embedding space of pre-trained LMs within a Poincar\\'e ball\nwith a curvature that adapts to the embedding dimension, followed by\nre-training on hyperbolic cluster and centripetal losses. These losses are\ndesigned to effectively cluster related entities (input as texts) and organise\nthem hierarchically. We evaluate HiTs against pre-trained and fine-tuned LMs,\nfocusing on their capabilities in simulating transitive inference, predicting\nsubsumptions, and transferring knowledge across hierarchies. The results\ndemonstrate that HiTs consistently outperform both pre-trained and fine-tuned\nLMs in these tasks, underscoring the effectiveness and transferability of our\nre-trained hierarchy encoders.\n","authors":["Yuan He","Zhangdie Yuan","Jiaoyan Chen","Ian Horrocks"],"pdf_url":"https://arxiv.org/pdf/2401.11374v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11373v1","updated":"2024-01-21T02:25:29Z","published":"2024-01-21T02:25:29Z","title":"Finding a Needle in the Adversarial Haystack: A Targeted Paraphrasing\n Approach For Uncovering Edge Cases with Minimal Distribution Distortion","summary":" Adversarial attacks against NLP Deep Learning models are a significant\nconcern. In particular, adversarial samples exploit the model's sensitivity to\nsmall input changes. While these changes appear insignificant on the semantics\nof the input sample, they result in significant decay in model performance. In\nthis paper, we propose Targeted Paraphrasing via RL (TPRL), an approach to\nautomatically learn a policy to generate challenging samples that most likely\nimprove the model's performance. TPRL leverages FLAN T5, a language model, as a\ngenerator and employs a self learned policy using a proximal policy gradient to\ngenerate the adversarial examples automatically. TPRL's reward is based on the\nconfusion induced in the classifier, preserving the original text meaning\nthrough a Mutual Implication score. We demonstrate and evaluate TPRL's\neffectiveness in discovering natural adversarial attacks and improving model\nperformance through extensive experiments on four diverse NLP classification\ntasks via Automatic and Human evaluation. TPRL outperforms strong baselines,\nexhibits generalizability across classifiers and datasets, and combines the\nstrengths of language modeling and reinforcement learning to generate diverse\nand influential adversarial examples.\n","authors":["Aly M. Kassem","Sherif Saad"],"pdf_url":"https://arxiv.org/pdf/2401.11373v1.pdf","comment":"EACL 2024 - Main conference"},{"id":"http://arxiv.org/abs/2401.11365v1","updated":"2024-01-21T01:37:25Z","published":"2024-01-21T01:37:25Z","title":"Confidence Preservation Property in Knowledge Distillation Abstractions","summary":" Social media platforms prevent malicious activities by detecting harmful\ncontent of posts and comments. To that end, they employ large-scale deep neural\nnetwork language models for sentiment analysis and content understanding. Some\nmodels, like BERT, are complex, and have numerous parameters, which makes them\nexpensive to operate and maintain. To overcome these deficiencies, industry\nexperts employ a knowledge distillation compression technique, where a\ndistilled model is trained to reproduce the classification behavior of the\noriginal model. The distillation processes terminates when the distillation\nloss function reaches the stopping criteria. This function is mainly designed\nto ensure that the original and the distilled models exhibit alike\nclassification behaviors. However, besides classification accuracy, there are\nadditional properties of the original model that the distilled model should\npreserve to be considered as an appropriate abstraction. In this work, we\nexplore whether distilled TinyBERT models preserve confidence values of the\noriginal BERT models, and investigate how this confidence preservation property\ncould guide tuning hyperparameters of the distillation process.\n","authors":["Dmitry Vengertsev","Elena Sherman"],"pdf_url":"https://arxiv.org/pdf/2401.11365v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11361v1","updated":"2024-01-21T01:18:08Z","published":"2024-01-21T01:18:08Z","title":"Revolutionizing API Documentation through Summarization","summary":" This study tackles the challenges associated with interpreting Application\nProgramming Interface (API) documentation, an integral aspect of software\ndevelopment. Official API documentation, while essential, can be lengthy and\nchallenging to navigate, prompting developers to seek unofficial sources such\nas Stack Overflow. Leveraging the vast user-generated content on Stack\nOverflow, including code snippets and discussions, we employ BERTopic and\nextractive summarization to automatically generate concise and informative API\nsummaries. These summaries encompass key insights like general usage, common\ndeveloper issues, and potential solutions, sourced from the wealth of knowledge\non Stack Overflow. Software developers evaluate these summaries for\nperformance, coherence, and interoperability, providing valuable feedback on\nthe practicality of our approach.\n","authors":["AmirHossein Naghshzan","Sylvie Ratte"],"pdf_url":"https://arxiv.org/pdf/2401.11361v1.pdf","comment":"arXiv admin note: text overlap with arXiv:2308.09070"},{"id":"http://arxiv.org/abs/2401.11356v1","updated":"2024-01-21T00:58:31Z","published":"2024-01-21T00:58:31Z","title":"ProLex: A Benchmark for Language Proficiency-oriented Lexical\n Substitution","summary":" Lexical Substitution discovers appropriate substitutes for a given target\nword in a context sentence. However, the task fails to consider substitutes\nthat are of equal or higher proficiency than the target, an aspect that could\nbe beneficial for language learners looking to improve their writing. To bridge\nthis gap, we propose a new task, language proficiency-oriented lexical\nsubstitution. We also introduce ProLex, a novel benchmark designed to assess\nsystems' ability to generate not only appropriate substitutes but also\nsubstitutes that demonstrate better language proficiency. Besides the\nbenchmark, we propose models that can automatically perform the new task. We\nshow that our best model, a Llama2-13B model fine-tuned with task-specific\nsynthetic data, outperforms ChatGPT by an average of 3.2% in F-score and\nachieves comparable results with GPT-4 on ProLex.\n","authors":["Xuanming Zhang","Zixun Chen","Zhou Yu"],"pdf_url":"https://arxiv.org/pdf/2401.11356v1.pdf","comment":null}],"Computer Vision and Pattern Recognition":[{"id":"http://arxiv.org/abs/2401.11631v1","updated":"2024-01-21T23:54:05Z","published":"2024-01-21T23:54:05Z","title":"Text-to-Image Cross-Modal Generation: A Systematic Review","summary":" We review research on generating visual data from text from the angle of\n\"cross-modal generation.\" This point of view allows us to draw parallels\nbetween various methods geared towards working on input text and producing\nvisual output, without limiting the analysis to narrow sub-areas. It also\nresults in the identification of common templates in the field, which are then\ncompared and contrasted both within pools of similar methods and across lines\nof research. We provide a breakdown of text-to-image generation into various\nflavors of image-from-text methods, video-from-text methods, image editing,\nself-supervised and graph-based approaches. In this discussion, we focus on\nresearch papers published at 8 leading machine learning conferences in the\nyears 2016-2022, also incorporating a number of relevant papers not matching\nthe outlined search criteria. The conducted review suggests a significant\nincrease in the number of papers published in the area and highlights research\ngaps and potential lines of investigation. To our knowledge, this is the first\nreview to systematically look at text-to-image generation from the perspective\nof \"cross-modal generation.\"\n","authors":["Maciej Żelaszczyk","Jacek Mańdziuk"],"pdf_url":"https://arxiv.org/pdf/2401.11631v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.05105v2","updated":"2024-01-21T23:04:32Z","published":"2023-03-09T08:24:02Z","title":"MaskDiff: Modeling Mask Distribution with Diffusion Probabilistic Model\n for Few-Shot Instance Segmentation","summary":" Few-shot instance segmentation extends the few-shot learning paradigm to the\ninstance segmentation task, which tries to segment instance objects from a\nquery image with a few annotated examples of novel categories. Conventional\napproaches have attempted to address the task via prototype learning, known as\npoint estimation. However, this mechanism depends on prototypes (\\eg mean of\n$K-$shot) for prediction, leading to performance instability. To overcome the\ndisadvantage of the point estimation mechanism, we propose a novel approach,\ndubbed MaskDiff, which models the underlying conditional distribution of a\nbinary mask, which is conditioned on an object region and $K-$shot information.\nInspired by augmentation approaches that perturb data with Gaussian noise for\npopulating low data density regions, we model the mask distribution with a\ndiffusion probabilistic model. We also propose to utilize classifier-free\nguided mask sampling to integrate category information into the binary mask\ngeneration process. Without bells and whistles, our proposed method\nconsistently outperforms state-of-the-art methods on both base and novel\nclasses of the COCO dataset while simultaneously being more stable than\nexisting methods. The source code is available at:\nhttps://github.com/minhquanlecs/MaskDiff.\n","authors":["Minh-Quan Le","Tam V. Nguyen","Trung-Nghia Le","Thanh-Toan Do","Minh N. Do","Minh-Triet Tran"],"pdf_url":"https://arxiv.org/pdf/2303.05105v2.pdf","comment":"Accepted at AAAI 2024 (oral presentation)"},{"id":"http://arxiv.org/abs/2401.11617v1","updated":"2024-01-21T22:50:44Z","published":"2024-01-21T22:50:44Z","title":"A Survey on African Computer Vision Datasets, Topics and Researchers","summary":" Computer vision encompasses a range of tasks such as object detection,\nsemantic segmentation, and 3D reconstruction. Despite its relevance to African\ncommunities, research in this field within Africa represents only 0.06% of\ntop-tier publications over the past decade. This study undertakes a thorough\nanalysis of 63,000 Scopus-indexed computer vision publications from Africa,\nspanning from 2012 to 2022. The aim is to provide a survey of African computer\nvision topics, datasets and researchers. A key aspect of our study is the\nidentification and categorization of African Computer Vision datasets using\nlarge language models that automatically parse abstracts of these publications.\nWe also provide a compilation of unofficial African Computer Vision datasets\ndistributed through challenges or data hosting platforms, and provide a full\ntaxonomy of dataset categories. Our survey also pinpoints computer vision\ntopics trends specific to different African regions, indicating their unique\nfocus areas. Additionally, we carried out an extensive survey to capture the\nviews of African researchers on the current state of computer vision research\nin the continent and the structural barriers they believe need urgent\nattention. In conclusion, this study catalogs and categorizes Computer Vision\ndatasets and topics contributed or initiated by African institutions and\nidentifies barriers to publishing in top-tier Computer Vision venues. This\nsurvey underscores the importance of encouraging African researchers and\ninstitutions in advancing computer vision research in the continent. It also\nstresses on the need for research topics to be more aligned with the needs of\nAfrican communities.\n","authors":["Abdul-Hakeem Omotayo","Ashery Mbilinyi","Lukman Ismaila","Houcemeddine Turki","Mahmoud Abdien","Karim Gamal","Idriss Tondji","Yvan Pimi","Naome A. Etori","Marwa M. Matar","Clifford Broni-Bediako","Abigail Oppong","Mai Gamal","Eman Ehab","Gbetondji Dovonon","Zainab Akinjobi","Daniel Ajisafe","Oluwabukola G. Adegboro","Mennatullah Siam"],"pdf_url":"https://arxiv.org/pdf/2401.11617v1.pdf","comment":"Under Review, Community Work of Ro'ya Grassroots,\n https://ro-ya-cv4africa.github.io/homepage/. arXiv admin note: text overlap\n with arXiv:2305.06773"},{"id":"http://arxiv.org/abs/2311.03500v2","updated":"2024-01-21T22:04:28Z","published":"2023-11-06T20:18:26Z","title":"Predicting Age from White Matter Diffusivity with Residual Learning","summary":" Imaging findings inconsistent with those expected at specific chronological\nage ranges may serve as early indicators of neurological disorders and\nincreased mortality risk. Estimation of chronological age, and deviations from\nexpected results, from structural MRI data has become an important task for\ndeveloping biomarkers that are sensitive to such deviations. Complementary to\nstructural analysis, diffusion tensor imaging (DTI) has proven effective in\nidentifying age-related microstructural changes within the brain white matter,\nthereby presenting itself as a promising additional modality for brain age\nprediction. Although early studies have sought to harness DTI's advantages for\nage estimation, there is no evidence that the success of this prediction is\nowed to the unique microstructural and diffusivity features that DTI provides,\nrather than the macrostructural features that are also available in DTI data.\nTherefore, we seek to develop white-matter-specific age estimation to capture\ndeviations from normal white matter aging. Specifically, we deliberately\ndisregard the macrostructural information when predicting age from DTI scalar\nimages, using two distinct methods. The first method relies on extracting only\nmicrostructural features from regions of interest. The second applies 3D\nresidual neural networks (ResNets) to learn features directly from the images,\nwhich are non-linearly registered and warped to a template to minimize\nmacrostructural variations. When tested on unseen data, the first method yields\nmean absolute error (MAE) of 6.11 years for cognitively normal participants and\nMAE of 6.62 years for cognitively impaired participants, while the second\nmethod achieves MAE of 4.69 years for cognitively normal participants and MAE\nof 4.96 years for cognitively impaired participants. We find that the ResNet\nmodel captures subtler, non-macrostructural features for brain age prediction.\n","authors":["Chenyu Gao","Michael E. Kim","Ho Hin Lee","Qi Yang","Nazirah Mohd Khairi","Praitayini Kanakaraj","Nancy R. Newlin","Derek B. Archer","Angela L. Jefferson","Warren D. Taylor","Brian D. Boyd","Lori L. Beason-Held","Susan M. Resnick","The BIOCARD Study Team","Yuankai Huo","Katherine D. Van Schaik","Kurt G. Schilling","Daniel Moyer","Ivana Išgum","Bennett A. Landman"],"pdf_url":"https://arxiv.org/pdf/2311.03500v2.pdf","comment":"SPIE Medical Imaging: Image Processing. San Diego, CA. February 2024\n (accepted as poster presentation)"},{"id":"http://arxiv.org/abs/2401.11605v1","updated":"2024-01-21T21:49:49Z","published":"2024-01-21T21:49:49Z","title":"Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass\n Diffusion Transformers","summary":" We present the Hourglass Diffusion Transformer (HDiT), an image generative\nmodel that exhibits linear scaling with pixel count, supporting training at\nhigh-resolution (e.g. $1024 \\times 1024$) directly in pixel-space. Building on\nthe Transformer architecture, which is known to scale to billions of\nparameters, it bridges the gap between the efficiency of convolutional U-Nets\nand the scalability of Transformers. HDiT trains successfully without typical\nhigh-resolution training techniques such as multiscale architectures, latent\nautoencoders or self-conditioning. We demonstrate that HDiT performs\ncompetitively with existing models on ImageNet $256^2$, and sets a new\nstate-of-the-art for diffusion models on FFHQ-$1024^2$.\n","authors":["Katherine Crowson","Stefan Andreas Baumann","Alex Birch","Tanishq Mathew Abraham","Daniel Z. Kaplan","Enrico Shippole"],"pdf_url":"https://arxiv.org/pdf/2401.11605v1.pdf","comment":"20 pages, 13 figures, project page and code available at\n https://crowsonkb.github.io/hourglass-diffusion-transformers/"},{"id":"http://arxiv.org/abs/2401.11598v1","updated":"2024-01-21T21:04:05Z","published":"2024-01-21T21:04:05Z","title":"TetraLoss: Improving the Robustness of Face Recognition against Morphing\n Attacks","summary":" Face recognition systems are widely deployed in high-security applications\nsuch as for biometric verification at border controls. Despite their high\naccuracy on pristine data, it is well-known that digital manipulations, such as\nface morphing, pose a security threat to face recognition systems. Malicious\nactors can exploit the facilities offered by the identity document issuance\nprocess to obtain identity documents containing morphed images. Thus, subjects\nwho contributed to the creation of the morphed image can with high probability\nuse the identity document to bypass automated face recognition systems. In\nrecent years, no-reference (i.e., single image) and differential morphing\nattack detectors have been proposed to tackle this risk. These systems are\ntypically evaluated in isolation from the face recognition system that they\nhave to operate jointly with and do not consider the face recognition process.\nContrary to most existing works, we present a novel method for adapting deep\nlearning-based face recognition systems to be more robust against face morphing\nattacks. To this end, we introduce TetraLoss, a novel loss function that learns\nto separate morphed face images from its contributing subjects in the embedding\nspace while still preserving high biometric verification performance. In a\ncomprehensive evaluation, we show that the proposed method can significantly\nenhance the original system while also significantly outperforming other tested\nbaseline methods.\n","authors":["Mathias Ibsen","Lázaro J. González-Soler","Christian Rathgeb","Christoph Busch"],"pdf_url":"https://arxiv.org/pdf/2401.11598v1.pdf","comment":"Accepted to the IEEE International Conference on Automatic Face &\n Gesture Recognition 2024 (FG'24)"},{"id":"http://arxiv.org/abs/2310.01361v2","updated":"2024-01-21T21:01:12Z","published":"2023-10-02T17:23:48Z","title":"GenSim: Generating Robotic Simulation Tasks via Large Language Models","summary":" Collecting large amounts of real-world interaction data to train general\nrobotic policies is often prohibitively expensive, thus motivating the use of\nsimulation data. However, existing methods for data generation have generally\nfocused on scene-level diversity (e.g., object instances and poses) rather than\ntask-level diversity, due to the human effort required to come up with and\nverify novel tasks. This has made it challenging for policies trained on\nsimulation data to demonstrate significant task-level generalization. In this\npaper, we propose to automatically generate rich simulation environments and\nexpert demonstrations by exploiting a large language models' (LLM) grounding\nand coding ability. Our approach, dubbed GenSim, has two modes: goal-directed\ngeneration, wherein a target task is given to the LLM and the LLM proposes a\ntask curriculum to solve the target task, and exploratory generation, wherein\nthe LLM bootstraps from previous tasks and iteratively proposes novel tasks\nthat would be helpful in solving more complex tasks. We use GPT4 to expand the\nexisting benchmark by ten times to over 100 tasks, on which we conduct\nsupervised finetuning and evaluate several LLMs including finetuned GPTs and\nCode Llama on code generation for robotic simulation tasks. Furthermore, we\nobserve that LLMs-generated simulation programs can enhance task-level\ngeneralization significantly when used for multitask policy training. We\nfurther find that with minimal sim-to-real adaptation, the multitask policies\npretrained on GPT4-generated simulation tasks exhibit stronger transfer to\nunseen long-horizon tasks in the real world and outperform baselines by 25%.\nSee the project website (https://liruiw.github.io/gensim) for code, demos, and\nvideos.\n","authors":["Lirui Wang","Yiyang Ling","Zhecheng Yuan","Mohit Shridhar","Chen Bao","Yuzhe Qin","Bailin Wang","Huazhe Xu","Xiaolong Wang"],"pdf_url":"https://arxiv.org/pdf/2310.01361v2.pdf","comment":"See our project website (https://liruiw.github.io/gensim), demo and\n datasets (https://huggingface.co/spaces/Gen-Sim/Gen-Sim), and code\n (https://github.com/liruiw/GenSim) for more details"},{"id":"http://arxiv.org/abs/2401.11582v1","updated":"2024-01-21T20:10:02Z","published":"2024-01-21T20:10:02Z","title":"Thermal Image Calibration and Correction using Unpaired Cycle-Consistent\n Adversarial Networks","summary":" Unmanned aerial vehicles (UAVs) offer a flexible and cost-effective solution\nfor wildfire monitoring. However, their widespread deployment during wildfires\nhas been hindered by a lack of operational guidelines and concerns about\npotential interference with aircraft systems. Consequently, the progress in\ndeveloping deep-learning models for wildfire detection and characterization\nusing aerial images is constrained by the limited availability, size, and\nquality of existing datasets. This paper introduces a solution aimed at\nenhancing the quality of current aerial wildfire datasets to align with\nadvancements in camera technology. The proposed approach offers a solution to\ncreate a comprehensive, standardized large-scale image dataset. This paper\npresents a pipeline based on CycleGAN to enhance wildfire datasets and a novel\nfusion method that integrates paired RGB images as attribute conditioning in\nthe generators of both directions, improving the accuracy of the generated\nimages.\n","authors":["Hossein Rajoli","Pouya Afshin","Fatemeh Afghah"],"pdf_url":"https://arxiv.org/pdf/2401.11582v1.pdf","comment":"This paper has been accepted at the Asilomar 2023 Conference and will\n be published"},{"id":"http://arxiv.org/abs/2303.05123v3","updated":"2024-01-21T18:11:49Z","published":"2023-03-09T09:12:21Z","title":"Dominating Set Database Selection for Visual Place Recognition","summary":" This paper presents an approach for creating a visual place recognition (VPR)\ndatabase for localization in indoor environments from RGBD scanning sequences.\nThe proposed approach is formulated as a minimization problem in terms of\ndominating set algorithm for graph, constructed from spatial information, and\nreferred as DominatingSet. Our algorithm shows better scene coverage in\ncomparison to other methodologies that are used for database creation. Also, we\ndemonstrate that using DominatingSet, a database size could be up to 250-1400\ntimes smaller than the original scanning sequence while maintaining a recall\nrate of more than 80% on testing sequences. We evaluated our algorithm on\n7-scenes and BundleFusion datasets and an additionally recorded sequence in a\nhighly repetitive office setting. In addition, the database selection can\nproduce weakly-supervised labels for fine-tuning neural place recognition\nalgorithms to particular settings, improving even more their accuracy. The\npaper also presents a fully automated pipeline for VPR database creation from\nRGBD scanning sequences, as well as a set of metrics for VPR database\nevaluation. The code and released data are available on our web-page~ --\nhttps://prime-slam.github.io/place-recognition-db/\n","authors":["Anastasiia Kornilova","Ivan Moskalenko","Timofei Pushkin","Fakhriddin Tojiboev","Rahim Tariverdizadeh","Gonzalo Ferrer"],"pdf_url":"https://arxiv.org/pdf/2303.05123v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11544v1","updated":"2024-01-21T16:59:44Z","published":"2024-01-21T16:59:44Z","title":"Hierarchical Prompts for Rehearsal-free Continual Learning","summary":" Continual learning endeavors to equip the model with the capability to\nintegrate current task knowledge while mitigating the forgetting of past task\nknowledge. Inspired by prompt tuning, prompt-based methods maintain a frozen\nbackbone and train with slight learnable prompts to minimize the catastrophic\nforgetting that arises due to updating a large number of backbone parameters.\nNonetheless, these learnable prompts tend to concentrate on the discriminatory\nknowledge of the current task while ignoring past task knowledge, leading to\nthat learnable prompts still suffering from catastrophic forgetting. This paper\nintroduces a novel rehearsal-free paradigm for continual learning termed\nHierarchical Prompts (H-Prompts), comprising three categories of prompts --\nclass prompt, task prompt, and general prompt. To effectively depict the\nknowledge of past classes, class prompt leverages Bayesian Distribution\nAlignment to model the distribution of classes in each task. To reduce the\nforgetting of past task knowledge, task prompt employs Cross-task Knowledge\nExcavation to amalgamate the knowledge encapsulated in the learned class\nprompts of past tasks and current task knowledge. Furthermore, general prompt\nutilizes Generalized Knowledge Exploration to deduce highly generalized\nknowledge in a self-supervised manner. Evaluations on two benchmarks\nsubstantiate the efficacy of the proposed H-Prompts, exemplified by an average\naccuracy of 87.8% in Split CIFAR-100 and 70.6% in Split ImageNet-R.\n","authors":["Yukun Zuo","Hantao Yao","Lu Yu","Liansheng Zhuang","Changsheng Xu"],"pdf_url":"https://arxiv.org/pdf/2401.11544v1.pdf","comment":"Submitted to TPAMI"},{"id":"http://arxiv.org/abs/2401.11543v1","updated":"2024-01-21T16:55:40Z","published":"2024-01-21T16:55:40Z","title":"How Robust Are Energy-Based Models Trained With Equilibrium Propagation?","summary":" Deep neural networks (DNNs) are easily fooled by adversarial perturbations\nthat are imperceptible to humans. Adversarial training, a process where\nadversarial examples are added to the training set, is the current\nstate-of-the-art defense against adversarial attacks, but it lowers the model's\naccuracy on clean inputs, is computationally expensive, and offers less\nrobustness to natural noise. In contrast, energy-based models (EBMs), which\nwere designed for efficient implementation in neuromorphic hardware and\nphysical systems, incorporate feedback connections from each layer to the\nprevious layer, yielding a recurrent, deep-attractor architecture which we\nhypothesize should make them naturally robust. Our work is the first to explore\nthe robustness of EBMs to both natural corruptions and adversarial attacks,\nwhich we do using the CIFAR-10 and CIFAR-100 datasets. We demonstrate that EBMs\nare more robust than transformers and display comparable robustness to\nadversarially-trained DNNs on gradient-based (white-box) attacks, query-based\n(black-box) attacks, and natural perturbations without sacrificing clean\naccuracy, and without the need for adversarial training or additional training\ntechniques.\n","authors":["Siddharth Mansingh","Michal Kucer","Garrett Kenyon","Juston Moore","Michael Teti"],"pdf_url":"https://arxiv.org/pdf/2401.11543v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11541v1","updated":"2024-01-21T16:46:04Z","published":"2024-01-21T16:46:04Z","title":"Multi-View Neural 3D Reconstruction of Micro-/Nanostructures with Atomic\n Force Microscopy","summary":" Atomic Force Microscopy (AFM) is a widely employed tool for micro-/nanoscale\ntopographic imaging. However, conventional AFM scanning struggles to\nreconstruct complex 3D micro-/nanostructures precisely due to limitations such\nas incomplete sample topography capturing and tip-sample convolution artifacts.\nHere, we propose a multi-view neural-network-based framework with AFM\n(MVN-AFM), which accurately reconstructs surface models of intricate\nmicro-/nanostructures. Unlike previous works, MVN-AFM does not depend on any\nspecially shaped probes or costly modifications to the AFM system. To achieve\nthis, MVN-AFM uniquely employs an iterative method to align multi-view data and\neliminate AFM artifacts simultaneously. Furthermore, we pioneer the application\nof neural implicit surface reconstruction in nanotechnology and achieve\nmarkedly improved results. Extensive experiments show that MVN-AFM effectively\neliminates artifacts present in raw AFM images and reconstructs various\nmicro-/nanostructures including complex geometrical microstructures printed via\nTwo-photon Lithography and nanoparticles such as PMMA nanospheres and ZIF-67\nnanocrystals. This work presents a cost-effective tool for micro-/nanoscale 3D\nanalysis.\n","authors":["Shuo Chen","Mao Peng","Yijin Li","Bing-Feng Ju","Hujun Bao","Yuan-Liu Chen","Guofeng Zhang"],"pdf_url":"https://arxiv.org/pdf/2401.11541v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.16301v2","updated":"2024-01-21T16:27:06Z","published":"2023-09-28T09:54:10Z","title":"Gated Cross-Attention Network for Depth Completion","summary":" Depth completion is a popular research direction in the field of depth\nestimation. The fusion of color and depth features is the current critical\nchallenge in this task, mainly due to the asymmetry between the rich scene\ndetails in color images and the sparse pixels in depth maps. To tackle this\nissue, we design an efficient Gated Cross-Attention Network that propagates\nconfidence via a gating mechanism, simultaneously extracting and refining key\ninformation in both color and depth branches to achieve local spatial feature\nfusion. Additionally, we employ an attention network based on the Transformer\nin low-dimensional space to effectively fuse global features and increase the\nnetwork's receptive field. With a simple yet efficient gating mechanism, our\nproposed method achieves fast and accurate depth completion without the need\nfor additional branches or post-processing steps. At the same time, we use the\nRay Tune mechanism with the AsyncHyperBandScheduler scheduler and the\nHyperOptSearch algorithm to automatically search for the optimal number of\nmodule iterations, which also allows us to achieve performance comparable to\nstate-of-the-art methods. We conduct experiments on both indoor and outdoor\nscene datasets. Our fast network achieves Pareto-optimal solutions in terms of\ntime and accuracy, and at the time of submission, our accurate network ranks\nfirst among all published papers on the KITTI official website in terms of\naccuracy.\n","authors":["Xiaogang Jia","Songlei Jian","Yusong Tan","Yonggang Che","Wei Chen","Zhengfa Liang"],"pdf_url":"https://arxiv.org/pdf/2309.16301v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.13472v3","updated":"2024-01-21T16:14:44Z","published":"2023-03-23T17:43:17Z","title":"Promptable Game Models: Text-Guided Game Simulation via Masked Diffusion\n Models","summary":" Neural video game simulators emerged as powerful tools to generate and edit\nvideos. Their idea is to represent games as the evolution of an environment's\nstate driven by the actions of its agents. While such a paradigm enables users\nto play a game action-by-action, its rigidity precludes more semantic forms of\ncontrol. To overcome this limitation, we augment game models with prompts\nspecified as a set of natural language actions and desired states. The result-a\nPromptable Game Model (PGM)-makes it possible for a user to play the game by\nprompting it with high- and low-level action sequences. Most captivatingly, our\nPGM unlocks the director's mode, where the game is played by specifying goals\nfor the agents in the form of a prompt. This requires learning \"game AI\",\nencapsulated by our animation model, to navigate the scene using high-level\nconstraints, play against an adversary, and devise a strategy to win a point.\nTo render the resulting state, we use a compositional NeRF representation\nencapsulated in our synthesis model. To foster future research, we present\nnewly collected, annotated and calibrated Tennis and Minecraft datasets. Our\nmethod significantly outperforms existing neural video game simulators in terms\nof rendering quality and unlocks applications beyond the capabilities of the\ncurrent state of the art. Our framework, data, and models are available at\nhttps://snap-research.github.io/promptable-game-models/.\n","authors":["Willi Menapace","Aliaksandr Siarohin","Stéphane Lathuilière","Panos Achlioptas","Vladislav Golyanik","Sergey Tulyakov","Elisa Ricci"],"pdf_url":"https://arxiv.org/pdf/2303.13472v3.pdf","comment":"ACM Transactions on Graphics \\c{opyright} Copyright is held by the\n owner/author(s) 2023. This is the author's version of the work. It is posted\n here for your personal use. Not for redistribution. The definitive Version of\n Record was published in ACM Transactions on Graphics,\n http://dx.doi.org/10.1145/3635705"},{"id":"http://arxiv.org/abs/2401.11535v1","updated":"2024-01-21T16:14:04Z","published":"2024-01-21T16:14:04Z","title":"Deformable Endoscopic Tissues Reconstruction with Gaussian Splatting","summary":" Surgical 3D reconstruction is a critical area of research in robotic surgery,\nwith recent works adopting variants of dynamic radiance fields to achieve\nsuccess in 3D reconstruction of deformable tissues from single-viewpoint\nvideos. However, these methods often suffer from time-consuming optimization or\ninferior quality, limiting their adoption in downstream tasks. Inspired by 3D\nGaussian Splatting, a recent trending 3D representation, we present EndoGS,\napplying Gaussian Splatting for deformable endoscopic tissue reconstruction.\nSpecifically, our approach incorporates deformation fields to handle dynamic\nscenes, depth-guided supervision to optimize 3D targets with a single\nviewpoint, and a spatial-temporal weight mask to mitigate tool occlusion. As a\nresult, EndoGS reconstructs and renders high-quality deformable endoscopic\ntissues from a single-viewpoint video, estimated depth maps, and labeled tool\nmasks. Experiments on DaVinci robotic surgery videos demonstrate that EndoGS\nachieves superior rendering quality. Code is available at\nhttps://github.com/HKU-MedAI/EndoGS.\n","authors":["Lingting Zhu","Zhao Wang","Zhenchao Jin","Guying Lin","Lequan Yu"],"pdf_url":"https://arxiv.org/pdf/2401.11535v1.pdf","comment":"Work in progress. 10 pages, 4 figures"},{"id":"http://arxiv.org/abs/2401.11519v1","updated":"2024-01-21T15:22:15Z","published":"2024-01-21T15:22:15Z","title":"CaBuAr: California Burned Areas dataset for delineation","summary":" Forest wildfires represent one of the catastrophic events that, over the last\ndecades, caused huge environmental and humanitarian damages. In addition to a\nsignificant amount of carbon dioxide emission, they are a source of risk to\nsociety in both short-term (e.g., temporary city evacuation due to fire) and\nlong-term (e.g., higher risks of landslides) cases. Consequently, the\navailability of tools to support local authorities in automatically identifying\nburned areas plays an important role in the continuous monitoring requirement\nto alleviate the aftereffects of such catastrophic events. The great\navailability of satellite acquisitions coupled with computer vision techniques\nrepresents an important step in developing such tools. This paper introduces a\nnovel open dataset that tackles the burned area delineation problem, a binary\nsegmentation problem applied to satellite imagery. The presented resource\nconsists of pre- and post-fire Sentinel-2 L2A acquisitions of California forest\nfires that took place starting in 2015. Raster annotations were generated from\nthe data released by California's Department of Forestry and Fire Protection.\nMoreover, in conjunction with the dataset, we release three different baselines\nbased on spectral indexes analyses, SegFormer, and U-Net models.\n","authors":["Daniele Rege Cambrin","Luca Colomba","Paolo Garza"],"pdf_url":"https://arxiv.org/pdf/2401.11519v1.pdf","comment":"Accepted at the IEEE Geoscience and Remote Sensing Magazine"},{"id":"http://arxiv.org/abs/2401.11511v1","updated":"2024-01-21T14:48:38Z","published":"2024-01-21T14:48:38Z","title":"MobileARLoc: On-device Robust Absolute Localisation for Pervasive\n Markerless Mobile AR","summary":" Recent years have seen significant improvement in absolute camera pose\nestimation, paving the way for pervasive markerless Augmented Reality (AR).\nHowever, accurate absolute pose estimation techniques are computation- and\nstorage-heavy, requiring computation offloading. As such, AR systems rely on\nvisual-inertial odometry (VIO) to track the device's relative pose between\nrequests to the server. However, VIO suffers from drift, requiring frequent\nabsolute repositioning. This paper introduces MobileARLoc, a new framework for\non-device large-scale markerless mobile AR that combines an absolute pose\nregressor (APR) with a local VIO tracking system. Absolute pose regressors\n(APRs) provide fast on-device pose estimation at the cost of reduced accuracy.\nTo address APR accuracy and reduce VIO drift, MobileARLoc creates a feedback\nloop where VIO pose estimations refine the APR predictions. The VIO system\nidentifies reliable predictions of APR, which are then used to compensate for\nthe VIO drift. We comprehensively evaluate MobileARLoc through dataset\nsimulations. MobileARLoc halves the error compared to the underlying APR and\nachieve fast (80\\,ms) on-device inference speed.\n","authors":["Changkun Liu","Yukun Zhao","Tristan Braud"],"pdf_url":"https://arxiv.org/pdf/2401.11511v1.pdf","comment":"Accepted for publication at the 3rd edition of the Pervasive and\n Resource-Constrained AI (PerConAI) workshop (co-located with PerCom 2024).\n arXiv admin note: substantial text overlap with arXiv:2308.05394"},{"id":"http://arxiv.org/abs/2401.11499v1","updated":"2024-01-21T14:09:49Z","published":"2024-01-21T14:09:49Z","title":"Self-Supervised Bird's Eye View Motion Prediction with Cross-Modality\n Signals","summary":" Learning the dense bird's eye view (BEV) motion flow in a self-supervised\nmanner is an emerging research for robotics and autonomous driving. Current\nself-supervised methods mainly rely on point correspondences between point\nclouds, which may introduce the problems of fake flow and inconsistency,\nhindering the model's ability to learn accurate and realistic motion. In this\npaper, we introduce a novel cross-modality self-supervised training framework\nthat effectively addresses these issues by leveraging multi-modality data to\nobtain supervision signals. We design three innovative supervision signals to\npreserve the inherent properties of scene motion, including the masked Chamfer\ndistance loss, the piecewise rigidity loss, and the temporal consistency loss.\nThrough extensive experiments, we demonstrate that our proposed self-supervised\nframework outperforms all previous self-supervision methods for the motion\nprediction task.\n","authors":["Shaoheng Fang","Zuhong Liu","Mingyu Wang","Chenxin Xu","Yiqi Zhong","Siheng Chen"],"pdf_url":"https://arxiv.org/pdf/2401.11499v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11492v1","updated":"2024-01-21T13:45:52Z","published":"2024-01-21T13:45:52Z","title":"Edge-Enabled Real-time Railway Track Segmentation","summary":" Accurate and rapid railway track segmentation can assist automatic train\ndriving and is a key step in early warning to fixed or moving obstacles on the\nrailway track. However, certain existing algorithms tailored for track\nsegmentation often struggle to meet the requirements of real-time and\nefficiency on resource-constrained edge devices. Considering this challenge, we\npropose an edge-enabled real-time railway track segmentation algorithm, which\nis optimized to be suitable for edge applications by optimizing the network\nstructure and quantizing the model after training. Initially, Ghost convolution\nis introduced to reduce the complexity of the backbone, thereby achieving the\nextraction of key information of the interested region at a lower cost. To\nfurther reduce the model complexity and calculation, a new lightweight\ndetection head is proposed to achieve the best balance between accuracy and\nefficiency. Subsequently, we introduce quantization techniques to map the\nmodel's floating-point weights and activation values into lower bit-width\nfixed-point representations, reducing computational demands and memory\nfootprint, ultimately accelerating the model's inference. Finally, we draw\ninspiration from GPU parallel programming principles to expedite the\npre-processing and post-processing stages of the algorithm by doing parallel\nprocessing. The approach is evaluated with public and challenging dataset\nRailSem19 and tested on Jetson Nano. Experimental results demonstrate that our\nenhanced algorithm achieves an accuracy level of 83.3% while achieving a\nreal-time inference rate of 25 frames per second when the input size is\n480x480, thereby effectively meeting the requirements for real-time and\nhigh-efficiency operation.\n","authors":["Chen Chenglin","Wang Fei","Yang Min","Qin Yong","Bai Yun"],"pdf_url":"https://arxiv.org/pdf/2401.11492v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.05173v4","updated":"2024-01-21T13:38:20Z","published":"2023-09-11T00:02:05Z","title":"DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning","summary":" Prompt tuning (PT), where a small amount of trainable soft (continuous)\nprompt vectors is affixed to the input of language models (LM), has shown\npromising results across various tasks and models for parameter-efficient\nfine-tuning (PEFT). PT stands out from other PEFT approaches because it\nmaintains competitive performance with fewer trainable parameters and does not\ndrastically scale up its parameters as the model size expands. However, PT\nintroduces additional soft prompt tokens, leading to longer input sequences,\nwhich significantly impacts training and inference time and memory usage due to\nthe Transformer's quadratic complexity. Particularly concerning for Large\nLanguage Models (LLMs) that face heavy daily querying. To address this issue,\nwe propose Decomposed Prompt Tuning (DePT), which decomposes the soft prompt\ninto a shorter soft prompt and a pair of low-rank matrices that are then\noptimised with two different learning rates. This allows DePT to achieve better\nperformance while saving substantial memory and time costs compared to vanilla\nPT and its variants, without changing trainable parameter sizes. Through\nextensive experiments on 23 natural language processing (NLP) and\nvision-language (VL) tasks, we demonstrate that DePT outperforms\nstate-of-the-art PEFT approaches, including the full fine-tuning baseline, in\nsome scenarios. Additionally, we empirically show that DEPT grows more\nefficient as the model size increases. Our further study reveals that DePT\nintegrates seamlessly with parameter-efficient transfer learning in the\nfew-shot learning setting and highlights its adaptability to various model\narchitectures and sizes.\n","authors":["Zhengxiang Shi","Aldo Lipani"],"pdf_url":"https://arxiv.org/pdf/2309.05173v4.pdf","comment":"ICLR 2024. Code is available at https://github.com/ZhengxiangShi/DePT"},{"id":"http://arxiv.org/abs/2303.11681v4","updated":"2024-01-21T13:35:44Z","published":"2023-03-21T08:43:15Z","title":"DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic\n Segmentation Using Diffusion Models","summary":" Collecting and annotating images with pixel-wise labels is time-consuming and\nlaborious. In contrast, synthetic data can be freely available using a\ngenerative model (e.g., DALL-E, Stable Diffusion). In this paper, we show that\nit is possible to automatically obtain accurate semantic masks of synthetic\nimages generated by the Off-the-shelf Stable Diffusion model, which uses only\ntext-image pairs during training. Our approach, called DiffuMask, exploits the\npotential of the cross-attention map between text and image, which is natural\nand seamless to extend the text-driven image synthesis to semantic mask\ngeneration. DiffuMask uses text-guided cross-attention information to localize\nclass/word-specific regions, which are combined with practical techniques to\ncreate a novel high-resolution and class-discriminative pixel-wise mask. The\nmethods help to reduce data collection and annotation costs obviously.\nExperiments demonstrate that the existing segmentation methods trained on\nsynthetic data of DiffuMask can achieve a competitive performance over the\ncounterpart of real data (VOC 2012, Cityscapes). For some classes (e.g., bird),\nDiffuMask presents promising performance, close to the stateof-the-art result\nof real data (within 3% mIoU gap). Moreover, in the open-vocabulary\nsegmentation (zero-shot) setting, DiffuMask achieves a new SOTA result on\nUnseen class of VOC 2012. The project website can be found at\nhttps://weijiawu.github.io/DiffusionMask/.\n","authors":["Weijia Wu","Yuzhong Zhao","Mike Zheng Shou","Hong Zhou","Chunhua Shen"],"pdf_url":"https://arxiv.org/pdf/2303.11681v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11489v1","updated":"2024-01-21T13:30:02Z","published":"2024-01-21T13:30:02Z","title":"MapChange: Enhancing Semantic Change Detection with Temporal-Invariant\n Historical Maps Based on Deep Triplet Network","summary":" Semantic Change Detection (SCD) is recognized as both a crucial and\nchallenging task in the field of image analysis. Traditional methods for SCD\nhave predominantly relied on the comparison of image pairs. However, this\napproach is significantly hindered by substantial imaging differences, which\narise due to variations in shooting times, atmospheric conditions, and angles.\nSuch discrepancies lead to two primary issues: the under-detection of minor yet\nsignificant changes, and the generation of false alarms due to temporal\nvariances. These factors often result in unchanged objects appearing markedly\ndifferent in multi-temporal images. In response to these challenges, the\nMapChange framework has been developed. This framework introduces a novel\nparadigm that synergizes temporal-invariant historical map data with\ncontemporary high-resolution images. By employing this combination, the\ntemporal variance inherent in conventional image pair comparisons is\neffectively mitigated. The efficacy of the MapChange framework has been\nempirically validated through comprehensive testing on two public datasets.\nThese tests have demonstrated the framework's marked superiority over existing\nstate-of-the-art SCD methods.\n","authors":["Yinhe Liu","Sunan Shi","Zhuo Zheng","Jue Wang","Shiqi Tian","Yanfei Zhong"],"pdf_url":"https://arxiv.org/pdf/2401.11489v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.01738v4","updated":"2024-01-21T13:27:31Z","published":"2023-08-03T12:58:23Z","title":"Enhancing Visibility in Nighttime Haze Images Using Guided APSF and\n Gradient Adaptive Convolution","summary":" Visibility in hazy nighttime scenes is frequently reduced by multiple\nfactors, including low light, intense glow, light scattering, and the presence\nof multicolored light sources. Existing nighttime dehazing methods often\nstruggle with handling glow or low-light conditions, resulting in either\nexcessively dark visuals or unsuppressed glow outputs. In this paper, we\nenhance the visibility from a single nighttime haze image by suppressing glow\nand enhancing low-light regions. To handle glow effects, our framework learns\nfrom the rendered glow pairs. Specifically, a light source aware network is\nproposed to detect light sources of night images, followed by the APSF\n(Atmospheric Point Spread Function)-guided glow rendering. Our framework is\nthen trained on the rendered images, resulting in glow suppression. Moreover,\nwe utilize gradient-adaptive convolution, to capture edges and textures in hazy\nscenes. By leveraging extracted edges and textures, we enhance the contrast of\nthe scene without losing important structural details. To boost low-light\nintensity, our network learns an attention map, then adjusted by gamma\ncorrection. This attention has high values on low-light regions and low values\non haze and glow regions. Extensive evaluation on real nighttime haze images,\ndemonstrates the effectiveness of our method. Our experiments demonstrate that\nour method achieves a PSNR of 30.38dB, outperforming state-of-the-art methods\nby 13% on GTA5 nighttime haze dataset. Our data and code is available at\nhttps://github.com/jinyeying/nighttime_dehaze.\n","authors":["Yeying Jin","Beibei Lin","Wending Yan","Yuan Yuan","Wei Ye","Robby T. Tan"],"pdf_url":"https://arxiv.org/pdf/2308.01738v4.pdf","comment":"Accepted to ACM'MM2023, https://github.com/jinyeying/nighttime_dehaze"},{"id":"http://arxiv.org/abs/2308.10610v2","updated":"2024-01-21T13:23:10Z","published":"2023-08-21T10:20:46Z","title":"Ultrafast and Ultralight Network-Based Intelligent System for Real-time\n Diagnosis of Ear diseases in Any Devices","summary":" Traditional ear disease diagnosis heavily depends on experienced specialists\nand specialized equipment, frequently resulting in misdiagnoses, treatment\ndelays, and financial burdens for some patients. Utilizing deep learning models\nfor efficient ear disease diagnosis has proven effective and affordable.\nHowever, existing research overlooked model inference speed and parameter size\nrequired for deployment. To tackle these challenges, we constructed a\nlarge-scale dataset comprising eight ear disease categories and normal ear\ncanal samples from two hospitals. Inspired by ShuffleNetV2, we developed\nBest-EarNet, an ultrafast and ultralight network enabling real-time ear disease\ndiagnosis. Best-EarNet incorporates the novel Local-Global Spatial Feature\nFusion Module which can capture global and local spatial information\nsimultaneously and guide the network to focus on crucial regions within feature\nmaps at various levels, mitigating low accuracy issues. Moreover, our network\nuses multiple auxiliary classification heads for efficient parameter\noptimization. With 0.77M parameters, Best-EarNet achieves an average frames per\nsecond of 80 on CPU. Employing transfer learning and five-fold cross-validation\nwith 22,581 images from Hospital-1, the model achieves an impressive 95.23%\naccuracy. External testing on 1,652 images from Hospital-2 validates its\nperformance, yielding 92.14% accuracy. Compared to state-of-the-art networks,\nBest-EarNet establishes a new state-of-the-art (SOTA) in practical\napplications. Most importantly, we developed an intelligent diagnosis system\ncalled Ear Keeper, which can be deployed on common electronic devices. By\nmanipulating a compact electronic otoscope, users can perform comprehensive\nscanning and diagnosis of the ear canal using real-time video. This study\nprovides a novel paradigm for ear endoscopy and other medical endoscopic image\nrecognition applications.\n","authors":["Yubiao Yue","Xinyu Zeng","Xiaoqiang Shi","Meiping Zhang","Haihua Liang","Fan Zhang","Yanmei Chen","Zefeng Xie","Wenrui Wu","Zhenzhang Li"],"pdf_url":"https://arxiv.org/pdf/2308.10610v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11485v1","updated":"2024-01-21T13:16:33Z","published":"2024-01-21T13:16:33Z","title":"ColorVideoVDP: A visual difference predictor for image, video and\n display distortions","summary":" ColorVideoVDP is a video and image quality metric that models spatial and\ntemporal aspects of vision, for both luminance and color. The metric is built\non novel psychophysical models of chromatic spatiotemporal contrast sensitivity\nand cross-channel contrast masking. It accounts for the viewing conditions,\ngeometric, and photometric characteristics of the display. It was trained to\npredict common video streaming distortions (e.g. video compression, rescaling,\nand transmission errors), and also 8 new distortion types related to AR/VR\ndisplays (e.g. light source and waveguide non-uniformities). To address the\nlatter application, we collected our novel XR-Display-Artifact-Video quality\ndataset (XR-DAVID), comprised of 336 distorted videos. Extensive testing on\nXR-DAVID, as well as several datasets from the literature, indicate a\nsignificant gain in prediction performance compared to existing metrics.\nColorVideoVDP opens the doors to many novel applications which require the\njoint automated spatiotemporal assessment of luminance and color distortions,\nincluding video streaming, display specification and design, visual comparison\nof results, and perceptually-guided quality optimization.\n","authors":["Rafal K. Mantiuk","Param Hanji","Maliha Ashraf","Yuta Asano","Alexandre Chapiro"],"pdf_url":"https://arxiv.org/pdf/2401.11485v1.pdf","comment":"28 pages"},{"id":"http://arxiv.org/abs/2401.04614v2","updated":"2024-01-21T12:56:32Z","published":"2024-01-09T15:36:07Z","title":"Generic Knowledge Boosted Pre-training For Remote Sensing Images","summary":" Deep learning models are essential for scene classification, change\ndetection, land cover segmentation, and other remote sensing image\nunderstanding tasks. Most backbones of existing remote sensing deep learning\nmodels are typically initialized by pre-trained weights obtained from ImageNet\npre-training (IMP). However, domain gaps exist between remote sensing images\nand natural images (e.g., ImageNet), making deep learning models initialized by\npre-trained weights of IMP perform poorly for remote sensing image\nunderstanding. Although some pre-training methods are studied in the remote\nsensing community, current remote sensing pre-training methods face the problem\nof vague generalization by only using remote sensing images. In this paper, we\npropose a novel remote sensing pre-training framework, Generic Knowledge\nBoosted Remote Sensing Pre-training (GeRSP), to learn robust representations\nfrom remote sensing and natural images for remote sensing understanding tasks.\nGeRSP contains two pre-training branches: (1) A self-supervised pre-training\nbranch is adopted to learn domain-related representations from unlabeled remote\nsensing images. (2) A supervised pre-training branch is integrated into GeRSP\nfor general knowledge learning from labeled natural images. Moreover, GeRSP\ncombines two pre-training branches using a teacher-student architecture to\nsimultaneously learn representations with general and special knowledge, which\ngenerates a powerful pre-trained model for deep learning model initialization.\nFinally, we evaluate GeRSP and other remote sensing pre-training methods on\nthree downstream tasks, i.e., object detection, semantic segmentation, and\nscene classification. The extensive experimental results consistently\ndemonstrate that GeRSP can effectively learn robust representations in a\nunified manner, improving the performance of remote sensing downstream tasks.\n","authors":["Ziyue Huang","Mingming Zhang","Yuan Gong","Qingjie Liu","Yunhong Wang"],"pdf_url":"https://arxiv.org/pdf/2401.04614v2.pdf","comment":"14 pages, 6 figures"},{"id":"http://arxiv.org/abs/2312.01632v3","updated":"2024-01-21T12:50:08Z","published":"2023-12-04T05:24:45Z","title":"GaussianHead: High-fidelity Head Avatars with Learnable Gaussian\n Derivation","summary":" Constructing vivid 3D head avatars for given subjects and realizing a series\nof animations on them is valuable yet challenging. This paper presents\nGaussianHead, which models the actional human head with anisotropic 3D\nGaussians. In our framework, a motion deformation field and multi-resolution\ntri-plane are constructed respectively to deal with the head's dynamic geometry\nand complex texture. Notably, we impose an exclusive derivation scheme on each\nGaussian, which generates its multiple doppelgangers through a set of learnable\nparameters for position transformation. With this design, we can compactly and\naccurately encode the appearance information of Gaussians, even those fitting\nthe head's particular components with sophisticated structures. In addition, an\ninherited derivation strategy for newly added Gaussians is adopted to\nfacilitate training acceleration. Extensive experiments show that our method\ncan produce high-fidelity renderings, outperforming state-of-the-art approaches\nin reconstruction, cross-identity reenactment, and novel view synthesis tasks.\nOur code is available at: https://github.com/chiehwangs/gaussian-head.\n","authors":["Jie Wang","Jiu-Cheng Xie","Xianyan Li","Feng Xu","Chi-Man Pun","Hao Gao"],"pdf_url":"https://arxiv.org/pdf/2312.01632v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.04089v2","updated":"2024-01-21T12:32:04Z","published":"2023-09-08T02:58:17Z","title":"Toward Sufficient Spatial-Frequency Interaction for Gradient-aware\n Underwater Image Enhancement","summary":" Underwater images suffer from complex and diverse degradation, which\ninevitably affects the performance of underwater visual tasks. However, most\nexisting learning-based Underwater image enhancement (UIE) methods mainly\nrestore such degradations in the spatial domain, and rarely pay attention to\nthe fourier frequency information. In this paper, we develop a novel UIE\nframework based on spatial-frequency interaction and gradient maps, namely\nSFGNet, which consists of two stages. Specifically, in the first stage, we\npropose a dense spatial-frequency fusion network (DSFFNet), mainly including\nour designed dense fourier fusion block and dense spatial fusion block,\nachieving sufficient spatial-frequency interaction by cross connections between\nthese two blocks. In the second stage, we propose a gradient-aware corrector\n(GAC) to further enhance perceptual details and geometric structures of images\nby gradient map. Experimental results on two real-world underwater image\ndatasets show that our approach can successfully enhance underwater images, and\nachieves competitive performance in visual quality improvement. The code is\navailable at https://github.com/zhihefang/SFGNet.\n","authors":["Chen Zhao","Weiling Cai","Chenyu Dong","Ziqi Zeng"],"pdf_url":"https://arxiv.org/pdf/2309.04089v2.pdf","comment":"Accepted by ICASSP 2024"},{"id":"http://arxiv.org/abs/2401.11470v1","updated":"2024-01-21T11:55:42Z","published":"2024-01-21T11:55:42Z","title":"Exploring Missing Modality in Multimodal Egocentric Datasets","summary":" Multimodal video understanding is crucial for analyzing egocentric videos,\nwhere integrating multiple sensory signals significantly enhances action\nrecognition and moment localization. However, practical applications often\ngrapple with incomplete modalities due to factors like privacy concerns,\nefficiency demands, or hardware malfunctions. Addressing this, our study delves\ninto the impact of missing modalities on egocentric action recognition,\nparticularly within transformer-based models. We introduce a novel concept\n-Missing Modality Token (MMT)-to maintain performance even when modalities are\nabsent, a strategy that proves effective in the Ego4D, Epic-Kitchens, and\nEpic-Sounds datasets. Our method mitigates the performance loss, reducing it\nfrom its original $\\sim 30\\%$ drop to only $\\sim 10\\%$ when half of the test\nset is modal-incomplete. Through extensive experimentation, we demonstrate the\nadaptability of MMT to different training scenarios and its superiority in\nhandling missing modalities compared to current methods. Our research\ncontributes a comprehensive analysis and an innovative approach, opening\navenues for more resilient multimodal systems in real-world settings.\n","authors":["Merey Ramazanova","Alejandro Pardo","Humam Alwassel","Bernard Ghanem"],"pdf_url":"https://arxiv.org/pdf/2401.11470v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2211.14835v2","updated":"2024-01-21T11:13:31Z","published":"2022-11-27T14:18:40Z","title":"CLID: Controlled-Length Image Descriptions with Limited Data","summary":" Controllable image captioning models generate human-like image descriptions,\nenabling some kind of control over the generated captions. This paper focuses\non controlling the caption length, i.e. a short and concise description or a\nlong and detailed one. Since existing image captioning datasets contain mostly\nshort captions, generating long captions is challenging. To address the\nshortage of long training examples, we propose to enrich the dataset with\nvarying-length self-generated captions. These, however, might be of varying\nquality and are thus unsuitable for conventional training. We introduce a novel\ntraining strategy that selects the data points to be used at different times\nduring the training. Our method dramatically improves the length-control\nabilities, while exhibiting SoTA performance in terms of caption quality. Our\napproach is general and is shown to be applicable also to paragraph generation.\n","authors":["Elad Hirsch","Ayellet Tal"],"pdf_url":"https://arxiv.org/pdf/2211.14835v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11464v1","updated":"2024-01-21T11:12:00Z","published":"2024-01-21T11:12:00Z","title":"Task-specific regularization loss towards model calibration for reliable\n lung cancer detection","summary":" Lung cancer is one of the significant causes of cancer-related deaths\nglobally. Early detection and treatment improve the chances of survival.\nTraditionally CT scans have been used to extract the most significant lung\ninfection information and diagnose cancer. This process is carried out manually\nby an expert radiologist. The imbalance in the radiologists-to-population ratio\nin a country like India implies significant work pressure on them and thus\nraises the need to automate a few of their responsibilities. The tendency of\nmodern-day Deep Neural networks to make overconfident mistakes limit their\nusage to detect cancer. In this paper, we propose a new task-specific loss\nfunction to calibrate the neural network to reduce the risk of overconfident\nmistakes. We use the state-of-the-art Multi-class Difference in Confidence and\nAccuracy (MDCA) loss in conjunction with the proposed task-specific loss\nfunction to achieve the same. We also integrate post-hoc calibration by\nperforming temperature scaling on top of the train-time calibrated model. We\ndemonstrate 5.98% improvement in the Expected Calibration Error (ECE) and a\n17.9% improvement in Maximum Calibration Error (MCE) as compared to the\nbest-performing SOTA algorithm.\n","authors":["Mehar Prateek Kalra","Mansi Singhal","Rohan Raju Dhanakashirur"],"pdf_url":"https://arxiv.org/pdf/2401.11464v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2106.04852v2","updated":"2024-01-21T10:54:55Z","published":"2021-06-09T07:20:54Z","title":"Deep Tiny Network for Recognition-Oriented Face Image Quality Assessment","summary":" Face recognition has made significant progress in recent years due to deep\nconvolutional neural networks (CNN). In many face recognition (FR) scenarios,\nface images are acquired from a sequence with huge intra-variations. These\nintra-variations, which are mainly affected by the low-quality face images,\ncause instability of recognition performance. Previous works have focused on\nad-hoc methods to select frames from a video or use face image quality\nassessment (FIQA) methods, which consider only a particular or combination of\nseveral distortions.\n In this work, we present an efficient non-reference image quality assessment\nfor FR that directly links image quality assessment (IQA) and FR. More\nspecifically, we propose a new measurement to evaluate image quality without\nany reference. Based on the proposed quality measurement, we propose a deep\nTiny Face Quality network (tinyFQnet) to learn a quality prediction function\nfrom data.\n We evaluate the proposed method for different powerful FR models on two\nclassical video-based (or template-based) benchmark: IJB-B and YTF. Extensive\nexperiments show that, although the tinyFQnet is much smaller than the others,\nthe proposed method outperforms state-of-the-art quality assessment methods in\nterms of effectiveness and efficiency.\n","authors":["Baoyun Peng","Min Liu","Zhaoning Zhang","Kai Xu","Dongsheng Li"],"pdf_url":"https://arxiv.org/pdf/2106.04852v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11453v1","updated":"2024-01-21T10:20:46Z","published":"2024-01-21T10:20:46Z","title":"Inter-Domain Mixup for Semi-Supervised Domain Adaptation","summary":" Semi-supervised domain adaptation (SSDA) aims to bridge source and target\ndomain distributions, with a small number of target labels available, achieving\nbetter classification performance than unsupervised domain adaptation (UDA).\nHowever, existing SSDA work fails to make full use of label information from\nboth source and target domains for feature alignment across domains, resulting\nin label mismatch in the label space during model testing. This paper presents\na novel SSDA approach, Inter-domain Mixup with Neighborhood Expansion (IDMNE),\nto tackle this issue. Firstly, we introduce a cross-domain feature alignment\nstrategy, Inter-domain Mixup, that incorporates label information into model\nadaptation. Specifically, we employ sample-level and manifold-level data mixing\nto generate compatible training samples. These newly established samples,\ncombined with reliable and actual label information, display diversity and\ncompatibility across domains, while such extra supervision thus facilitates\ncross-domain feature alignment and mitigates label mismatch. Additionally, we\nutilize Neighborhood Expansion to leverage high-confidence pseudo-labeled\nsamples in the target domain, diversifying the label information of the target\ndomain and thereby further increasing the performance of the adaptation model.\nAccordingly, the proposed approach outperforms existing state-of-the-art\nmethods, achieving significant accuracy improvements on popular SSDA\nbenchmarks, including DomainNet, Office-Home, and Office-31.\n","authors":["Jichang Li","Guanbin Li","Yizhou Yu"],"pdf_url":"https://arxiv.org/pdf/2401.11453v1.pdf","comment":"Publisted to Elsevier PR2024, available at\n https://www.sciencedirect.com/science/article/pii/S0031320323007203?via%3Dihub"},{"id":"http://arxiv.org/abs/2401.11448v1","updated":"2024-01-21T09:57:56Z","published":"2024-01-21T09:57:56Z","title":"Adaptive Betweenness Clustering for Semi-Supervised Domain Adaptation","summary":" Compared to unsupervised domain adaptation, semi-supervised domain adaptation\n(SSDA) aims to significantly improve the classification performance and\ngeneralization capability of the model by leveraging the presence of a small\namount of labeled data from the target domain. Several SSDA approaches have\nbeen developed to enable semantic-aligned feature confusion between labeled (or\npseudo labeled) samples across domains; nevertheless, owing to the scarcity of\nsemantic label information of the target domain, they were arduous to fully\nrealize their potential. In this study, we propose a novel SSDA approach named\nGraph-based Adaptive Betweenness Clustering (G-ABC) for achieving categorical\ndomain alignment, which enables cross-domain semantic alignment by mandating\nsemantic transfer from labeled data of both the source and target domains to\nunlabeled target samples. In particular, a heterogeneous graph is initially\nconstructed to reflect the pairwise relationships between labeled samples from\nboth domains and unlabeled ones of the target domain. Then, to degrade the\nnoisy connectivity in the graph, connectivity refinement is conducted by\nintroducing two strategies, namely Confidence Uncertainty based Node Removal\nand Prediction Dissimilarity based Edge Pruning. Once the graph has been\nrefined, Adaptive Betweenness Clustering is introduced to facilitate semantic\ntransfer by using across-domain betweenness clustering and within-domain\nbetweenness clustering, thereby propagating semantic label information from\nlabeled samples across domains to unlabeled target data. Extensive experiments\non three standard benchmark datasets, namely DomainNet, Office-Home, and\nOffice-31, indicated that our method outperforms previous state-of-the-art SSDA\napproaches, demonstrating the superiority of the proposed G-ABC algorithm.\n","authors":["Jichang Li","Guanbin Li","Yizhou Yu"],"pdf_url":"https://arxiv.org/pdf/2401.11448v1.pdf","comment":"16 pages, 9 figures, published to IEEE TIP"},{"id":"http://arxiv.org/abs/2401.11439v1","updated":"2024-01-21T09:39:11Z","published":"2024-01-21T09:39:11Z","title":"General Flow as Foundation Affordance for Scalable Robot Learning","summary":" We address the challenge of acquiring real-world manipulation skills with a\nscalable framework.Inspired by the success of large-scale auto-regressive\nprediction in Large Language Models (LLMs), we hold the belief that identifying\nan appropriate prediction target capable of leveraging large-scale datasets is\ncrucial for achieving efficient and universal learning. Therefore, we propose\nto utilize flow, which represents the future trajectories of 3D points on\nobjects of interest, as an ideal prediction target in robot learning. To\nexploit scalable data resources, we turn our attention to cross-embodiment\ndatasets. We develop, for the first time, a language-conditioned prediction\nmodel directly from large-scale RGBD human video datasets. Our predicted flow\noffers actionable geometric and physics guidance, thus facilitating stable\nzero-shot skill transfer in real-world scenarios.We deploy our method with a\npolicy based on closed-loop flow prediction. Remarkably, without any additional\ntraining, our method achieves an impressive 81% success rate in human-to-robot\nskill transfer, covering 18 tasks in 6 scenes. Our framework features the\nfollowing benefits: (1) scalability: leveraging cross-embodiment data\nresources; (2) universality: multiple object categories, including rigid,\narticulated, and soft bodies; (3) stable skill transfer: providing actionable\nguidance with a small inference domain-gap. These lead to a new pathway towards\nscalable general robot learning. Data, code, and model weights will be made\npublicly available.\n","authors":["Chengbo Yuan","Chuan Wen","Tong Zhang","Yang Gao"],"pdf_url":"https://arxiv.org/pdf/2401.11439v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11436v1","updated":"2024-01-21T09:16:29Z","published":"2024-01-21T09:16:29Z","title":"Geometric Prior Guided Feature Representation Learning for Long-Tailed\n Classification","summary":" Real-world data are long-tailed, the lack of tail samples leads to a\nsignificant limitation in the generalization ability of the model. Although\nnumerous approaches of class re-balancing perform well for moderate class\nimbalance problems, additional knowledge needs to be introduced to help the\ntail class recover the underlying true distribution when the observed\ndistribution from a few tail samples does not represent its true distribution\nproperly, thus allowing the model to learn valuable information outside the\nobserved domain. In this work, we propose to leverage the geometric information\nof the feature distribution of the well-represented head class to guide the\nmodel to learn the underlying distribution of the tail class. Specifically, we\nfirst systematically define the geometry of the feature distribution and the\nsimilarity measures between the geometries, and discover four phenomena\nregarding the relationship between the geometries of different feature\ndistributions. Then, based on four phenomena, feature uncertainty\nrepresentation is proposed to perturb the tail features by utilizing the\ngeometry of the head class feature distribution. It aims to make the perturbed\nfeatures cover the underlying distribution of the tail class as much as\npossible, thus improving the model's generalization performance in the test\ndomain. Finally, we design a three-stage training scheme enabling feature\nuncertainty modeling to be successfully applied. Experiments on\nCIFAR-10/100-LT, ImageNet-LT, and iNaturalist2018 show that our proposed\napproach outperforms other similar methods on most metrics. In addition, the\nexperimental phenomena we discovered are able to provide new perspectives and\ntheoretical foundations for subsequent studies.\n","authors":["Yanbiao Ma","Licheng Jiao","Fang Liu","Shuyuan Yang","Xu Liu","Puhua Chen"],"pdf_url":"https://arxiv.org/pdf/2401.11436v1.pdf","comment":"This work was accepted by the IJCV"},{"id":"http://arxiv.org/abs/2401.09496v2","updated":"2024-01-21T08:51:37Z","published":"2024-01-17T01:37:17Z","title":"Learning to Generalize over Subpartitions for Heterogeneity-aware Domain\n Adaptive Nuclei Segmentation","summary":" Annotation scarcity and cross-modality/stain data distribution shifts are two\nmajor obstacles hindering the application of deep learning models for nuclei\nanalysis, which holds a broad spectrum of potential applications in digital\npathology. Recently, unsupervised domain adaptation (UDA) methods have been\nproposed to mitigate the distributional gap between different imaging\nmodalities for unsupervised nuclei segmentation in histopathology images.\nHowever, existing UDA methods are built upon the assumption that data\ndistributions within each domain should be uniform. Based on the\nover-simplified supposition, they propose to align the histopathology target\ndomain with the source domain integrally, neglecting severe intra-domain\ndiscrepancy over subpartitions incurred by mixed cancer types and sampling\norgans. In this paper, for the first time, we propose to explicitly consider\nthe heterogeneity within the histopathology domain and introduce open compound\ndomain adaptation (OCDA) to resolve the crux. In specific, a two-stage\ndisentanglement framework is proposed to acquire domain-invariant feature\nrepresentations at both image and instance levels. The holistic design\naddresses the limitations of existing OCDA approaches which struggle to capture\ninstance-wise variations. Two regularization strategies are specifically\ndevised herein to leverage the rich subpartition-specific characteristics in\nhistopathology images and facilitate subdomain decomposition. Moreover, we\npropose a dual-branch nucleus shape and structure preserving module to prevent\nnucleus over-generation and deformation in the synthesized images. Experimental\nresults on both cross-modality and cross-stain scenarios over a broad range of\ndiverse datasets demonstrate the superiority of our method compared with\nstate-of-the-art UDA and OCDA methods.\n","authors":["Jianan Fan","Dongnan Liu","Hang Chang","Weidong Cai"],"pdf_url":"https://arxiv.org/pdf/2401.09496v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11430v1","updated":"2024-01-21T08:35:25Z","published":"2024-01-21T08:35:25Z","title":"Exploring Diffusion Time-steps for Unsupervised Representation Learning","summary":" Representation learning is all about discovering the hidden modular\nattributes that generate the data faithfully. We explore the potential of\nDenoising Diffusion Probabilistic Model (DM) in unsupervised learning of the\nmodular attributes. We build a theoretical framework that connects the\ndiffusion time-steps and the hidden attributes, which serves as an effective\ninductive bias for unsupervised learning. Specifically, the forward diffusion\nprocess incrementally adds Gaussian noise to samples at each time-step, which\nessentially collapses different samples into similar ones by losing attributes,\ne.g., fine-grained attributes such as texture are lost with less noise added\n(i.e., early time-steps), while coarse-grained ones such as shape are lost by\nadding more noise (i.e., late time-steps). To disentangle the modular\nattributes, at each time-step t, we learn a t-specific feature to compensate\nfor the newly lost attribute, and the set of all 1,...,t-specific features,\ncorresponding to the cumulative set of lost attributes, are trained to make up\nfor the reconstruction error of a pre-trained DM at time-step t. On CelebA,\nFFHQ, and Bedroom datasets, the learned feature significantly improves\nattribute classification and enables faithful counterfactual generation, e.g.,\ninterpolating only one specified attribute between two images, validating the\ndisentanglement quality. Codes are in https://github.com/yue-zhongqi/diti.\n","authors":["Zhongqi Yue","Jiankun Wang","Qianru Sun","Lei Ji","Eric I-Chao Chang","Hanwang Zhang"],"pdf_url":"https://arxiv.org/pdf/2401.11430v1.pdf","comment":"Accepted by ICLR 2024"},{"id":"http://arxiv.org/abs/2401.11425v1","updated":"2024-01-21T08:18:45Z","published":"2024-01-21T08:18:45Z","title":"Grayscale Image Colorization with GAN and CycleGAN in Different Image\n Domain","summary":" Automatic colorization of grayscale image has been a challenging task.\nPrevious research have applied supervised methods in conquering this problem [\n1]. In this paper, we reproduces a GAN-based coloring model, and experiments\none of its variant. We also proposed a CycleGAN based model and experiments\nthose methods on various datasets. The result shows that the proposed CycleGAN\nmodel does well in human-face coloring and comic coloring, but lack the ability\nto diverse colorization.\n","authors":["Chen Liang","Yunchen Sheng","Yichen Mo"],"pdf_url":"https://arxiv.org/pdf/2401.11425v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11421v1","updated":"2024-01-21T07:57:04Z","published":"2024-01-21T07:57:04Z","title":"Enhancing the vision-language foundation model with key semantic\n knowledge-emphasized report refinement","summary":" Recently, vision-language representation learning has made remarkable\nadvancements in building up medical foundation models, holding immense\npotential for transforming the landscape of clinical research and medical care.\nThe underlying hypothesis is that the rich knowledge embedded in radiology\nreports can effectively assist and guide the learning process, reducing the\nneed for additional labels. However, these reports tend to be complex and\nsometimes even consist of redundant descriptions that make the representation\nlearning too challenging to capture the key semantic information. This paper\ndevelops a novel iterative vision-language representation learning framework by\nproposing a key semantic knowledge-emphasized report refinement method.\nParticularly, raw radiology reports are refined to highlight the key\ninformation according to a constructed clinical dictionary and two\nmodel-optimized knowledge-enhancement metrics. The iterative framework is\ndesigned to progressively learn, starting from gaining a general understanding\nof the patient's condition based on raw reports and gradually refines and\nextracts critical information essential to the fine-grained analysis tasks. The\neffectiveness of the proposed framework is validated on various downstream\nmedical image analysis tasks, including disease classification,\nregion-of-interest segmentation, and phrase grounding. Our framework surpasses\nseven state-of-the-art methods in both fine-tuning and zero-shot settings,\ndemonstrating its encouraging potential for different clinical applications.\n","authors":["Cheng Li","Weijian Huang","Hao Yang","Jiarun Liu","Shanshan Wang"],"pdf_url":"https://arxiv.org/pdf/2401.11421v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11420v1","updated":"2024-01-21T07:48:39Z","published":"2024-01-21T07:48:39Z","title":"Embedded Hyperspectral Band Selection with Adaptive Optimization for\n Image Semantic Segmentation","summary":" Hyperspectral band selection plays a pivotal role in remote sensing and image\nanalysis, aiming to identify the most informative spectral bands while\nminimizing computational overhead. In this paper, we introduce a pioneering\napproach for hyperspectral band selection that offers an embedded solution,\nmaking it well-suited for resource-constrained or real-time applications. Our\nproposed method, embedded Hyperspectral Band Selection (EHBS), excels in\nselecting the best bands without the need for prior processing, seamlessly\nintegrating with the downstream task model. This is achieved through the\nadaptation of the Stochastic Gates (STG) algorithm, originally designed for\nfeature selection, for hyperspectral band selection in the context of image\nsemantic segmentation and the integration of a dynamic optimizer, DoG, which\nremoves the need for the required tuning the learning rate. To assess the\nperformance of our method, we introduce a novel metric for evaluating band\nselection methods across different target numbers of selected bands quantified\nby the Area Under the Curve (AUC). We conduct experiments on two distinct\nsemantic-segmentation hyperspectral benchmark datasets, demonstrating its\nsuperiority in terms of its resulting accuracy and its ease of use compared to\nmany common and state-of-the-art methods. Furthermore, our contributions extend\nbeyond the realm of hyperspectral band selection. The adaptability of our\napproach to other tasks, especially those involving grouped features, opens up\npromising avenues for broader applications within the realm of deep learning,\nsuch as feature selection for feature groups. The demonstrated success on the\ntested datasets and the potential for application to a variety of tasks\nunderscore the value of our method as a substantial addition to the field of\ncomputer vision.\n","authors":["Yaniv Zimmer","Oren Glickman"],"pdf_url":"https://arxiv.org/pdf/2401.11420v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.17005v3","updated":"2024-01-21T07:36:52Z","published":"2023-11-28T17:59:04Z","title":"MVBench: A Comprehensive Multi-modal Video Understanding Benchmark","summary":" With the rapid development of Multi-modal Large Language Models (MLLMs), a\nnumber of diagnostic benchmarks have recently emerged to evaluate the\ncomprehension capabilities of these models. However, most benchmarks\npredominantly assess spatial understanding in the static image tasks, while\noverlooking temporal understanding in the dynamic video tasks. To alleviate\nthis issue, we introduce a comprehensive Multi-modal Video understanding\nBenchmark, namely MVBench, which covers 20 challenging video tasks that cannot\nbe effectively solved with a single frame. Specifically, we first introduce a\nnovel static-to-dynamic method to define these temporal-related tasks. By\ntransforming various static tasks into dynamic ones, we enable the systematic\ngeneration of video tasks that require a broad spectrum of temporal skills,\nranging from perception to cognition. Then, guided by the task definition, we\nautomatically convert public video annotations into multiple-choice QA to\nevaluate each task. On one hand, such a distinct paradigm allows us to build\nMVBench efficiently, without much manual intervention. On the other hand, it\nguarantees evaluation fairness with ground-truth video annotations, avoiding\nthe biased scoring of LLMs. Moreover, we further develop a robust video MLLM\nbaseline, i.e., VideoChat2, by progressive multi-modal training with diverse\ninstruction-tuning data. The extensive results on our MVBench reveal that, the\nexisting MLLMs are far from satisfactory in temporal understanding, while our\nVideoChat2 largely surpasses these leading models by over 15% on MVBench. All\nmodels and data are available at https://github.com/OpenGVLab/Ask-Anything.\n","authors":["Kunchang Li","Yali Wang","Yinan He","Yizhuo Li","Yi Wang","Yi Liu","Zun Wang","Jilan Xu","Guo Chen","Ping Luo","Limin Wang","Yu Qiao"],"pdf_url":"https://arxiv.org/pdf/2311.17005v3.pdf","comment":"18 pages, 7 figures, 19 tables"},{"id":"http://arxiv.org/abs/2401.09671v2","updated":"2024-01-21T07:27:25Z","published":"2024-01-18T01:07:00Z","title":"Towards Identifiable Unsupervised Domain Translation: A Diversified\n Distribution Matching Approach","summary":" Unsupervised domain translation (UDT) aims to find functions that convert\nsamples from one domain (e.g., sketches) to another domain (e.g., photos)\nwithout changing the high-level semantic meaning (also referred to as\n``content''). The translation functions are often sought by probability\ndistribution matching of the transformed source domain and target domain.\nCycleGAN stands as arguably the most representative approach among this line of\nwork. However, it was noticed in the literature that CycleGAN and variants\ncould fail to identify the desired translation functions and produce\ncontent-misaligned translations. This limitation arises due to the presence of\nmultiple translation functions -- referred to as ``measure-preserving\nautomorphism\" (MPA) -- in the solution space of the learning criteria. Despite\nawareness of such identifiability issues, solutions have remained elusive. This\nstudy delves into the core identifiability inquiry and introduces an MPA\nelimination theory. Our analysis shows that MPA is unlikely to exist, if\nmultiple pairs of diverse cross-domain conditional distributions are matched by\nthe learning function. Our theory leads to a UDT learner using distribution\nmatching over auxiliary variable-induced subsets of the domains -- other than\nover the entire data domains as in the classical approaches. The proposed\nframework is the first to rigorously establish translation identifiability\nunder reasonable UDT settings, to our best knowledge. Experiments corroborate\nwith our theoretical claims.\n","authors":["Sagar Shrestha","Xiao Fu"],"pdf_url":"https://arxiv.org/pdf/2401.09671v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11414v1","updated":"2024-01-21T06:47:33Z","published":"2024-01-21T06:47:33Z","title":"S$^3$M-Net: Joint Learning of Semantic Segmentation and Stereo Matching\n for Autonomous Driving","summary":" Semantic segmentation and stereo matching are two essential components of 3D\nenvironmental perception systems for autonomous driving. Nevertheless,\nconventional approaches often address these two problems independently,\nemploying separate models for each task. This approach poses practical\nlimitations in real-world scenarios, particularly when computational resources\nare scarce or real-time performance is imperative. Hence, in this article, we\nintroduce S$^3$M-Net, a novel joint learning framework developed to perform\nsemantic segmentation and stereo matching simultaneously. Specifically,\nS$^3$M-Net shares the features extracted from RGB images between both tasks,\nresulting in an improved overall scene understanding capability. This feature\nsharing process is realized using a feature fusion adaption (FFA) module, which\neffectively transforms the shared features into semantic space and subsequently\nfuses them with the encoded disparity features. The entire joint learning\nframework is trained by minimizing a novel semantic consistency-guided (SCG)\nloss, which places emphasis on the structural consistency in both tasks.\nExtensive experimental results conducted on the vKITTI2 and KITTI datasets\ndemonstrate the effectiveness of our proposed joint learning framework and its\nsuperior performance compared to other state-of-the-art single-task networks.\nOur project webpage is accessible at mias.group/S3M-Net.\n","authors":["Zhiyuan Wu","Yi Feng","Chuang-Wei Liu","Fisher Yu","Qijun Chen","Rui Fan"],"pdf_url":"https://arxiv.org/pdf/2401.11414v1.pdf","comment":"accepted to IEEE Trans. on Intelligent Vehicles (T-IV)"},{"id":"http://arxiv.org/abs/2401.11406v1","updated":"2024-01-21T05:50:39Z","published":"2024-01-21T05:50:39Z","title":"Adversarial Augmentation Training Makes Action Recognition Models More\n Robust to Realistic Video Distribution Shifts","summary":" Despite recent advances in video action recognition achieving strong\nperformance on existing benchmarks, these models often lack robustness when\nfaced with natural distribution shifts between training and test data. We\npropose two novel evaluation methods to assess model resilience to such\ndistribution disparity. One method uses two different datasets collected from\ndifferent sources and uses one for training and validation, and the other for\ntesting. More precisely, we created dataset splits of HMDB-51 or UCF-101 for\ntraining, and Kinetics-400 for testing, using the subset of the classes that\nare overlapping in both train and test datasets. The other proposed method\nextracts the feature mean of each class from the target evaluation dataset's\ntraining data (i.e. class prototype) and estimates test video prediction as a\ncosine similarity score between each sample to the class prototypes of each\ntarget class. This procedure does not alter model weights using the target\ndataset and it does not require aligning overlapping classes of two different\ndatasets, thus is a very efficient method to test the model robustness to\ndistribution shifts without prior knowledge of the target distribution. We\naddress the robustness problem by adversarial augmentation training -\ngenerating augmented views of videos that are \"hard\" for the classification\nmodel by applying gradient ascent on the augmentation parameters - as well as\n\"curriculum\" scheduling the strength of the video augmentations. We\nexperimentally demonstrate the superior performance of the proposed adversarial\naugmentation approach over baselines across three state-of-the-art action\nrecognition models - TSM, Video Swin Transformer, and Uniformer. The presented\nwork provides critical insight into model robustness to distribution shifts and\npresents effective techniques to enhance video action recognition performance\nin a real-world deployment.\n","authors":["Kiyoon Kim","Shreyank N Gowda","Panagiotis Eustratiadis","Antreas Antoniou","Robert B Fisher"],"pdf_url":"https://arxiv.org/pdf/2401.11406v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.11737v2","updated":"2024-01-21T04:55:06Z","published":"2023-08-22T18:57:07Z","title":"Animal3D: A Comprehensive Dataset of 3D Animal Pose and Shape","summary":" Accurately estimating the 3D pose and shape is an essential step towards\nunderstanding animal behavior, and can potentially benefit many downstream\napplications, such as wildlife conservation. However, research in this area is\nheld back by the lack of a comprehensive and diverse dataset with high-quality\n3D pose and shape annotations. In this paper, we propose Animal3D, the first\ncomprehensive dataset for mammal animal 3D pose and shape estimation. Animal3D\nconsists of 3379 images collected from 40 mammal species, high-quality\nannotations of 26 keypoints, and importantly the pose and shape parameters of\nthe SMAL model. All annotations were labeled and checked manually in a\nmulti-stage process to ensure highest quality results. Based on the Animal3D\ndataset, we benchmark representative shape and pose estimation models at: (1)\nsupervised learning from only the Animal3D data, (2) synthetic to real transfer\nfrom synthetically generated images, and (3) fine-tuning human pose and shape\nestimation models. Our experimental results demonstrate that predicting the 3D\nshape and pose of animals across species remains a very challenging task,\ndespite significant advances in human pose estimation. Our results further\ndemonstrate that synthetic pre-training is a viable strategy to boost the model\nperformance. Overall, Animal3D opens new directions for facilitating future\nresearch in animal 3D pose and shape estimation, and is publicly available.\n","authors":["Jiacong Xu","Yi Zhang","Jiawei Peng","Wufei Ma","Artur Jesslen","Pengliang Ji","Qixin Hu","Jiehua Zhang","Qihao Liu","Jiahao Wang","Wei Ji","Chen Wang","Xiaoding Yuan","Prakhar Kaushik","Guofeng Zhang","Jie Liu","Yushan Xie","Yawen Cui","Alan Yuille","Adam Kortylewski"],"pdf_url":"https://arxiv.org/pdf/2308.11737v2.pdf","comment":"11 pages, 5 figures, link to the dataset:\n https://xujiacong.github.io/Animal3D/"}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2401.11632v1","updated":"2024-01-21T23:56:57Z","published":"2024-01-21T23:56:57Z","title":"What Are We Optimizing For? A Human-centric Evaluation Of Deep\n Learning-based Recommender Systems","summary":" Deep learning-based (DL) models in recommender systems (RecSys) have gained\nsignificant recognition for their remarkable accuracy in predicting user\npreferences. However, their performance often lacks a comprehensive evaluation\nfrom a human-centric perspective, which encompasses various dimensions beyond\nsimple interest matching. In this work, we have developed a robust\nhuman-centric evaluation framework that incorporates seven diverse metrics to\nassess the quality of recommendations generated by five recent open-sourced DL\nmodels. Our evaluation datasets consist of both offline benchmark data and\npersonalized online recommendation feedback collected from 445 real users. We\nfind that (1) different DL models have different pros and cons in the\nmulti-dimensional metrics that we test with; (2) users generally want a\ncombination of accuracy with at least one another human values in the\nrecommendation; (3) the degree of combination of different values needs to be\ncarefully experimented to user preferred level.\n","authors":["Ruixuan Sun","Avinash Akella","Xinyi Wu","Ruoyan Kong","Joseph A. Konstan"],"pdf_url":"https://arxiv.org/pdf/2401.11632v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11624v1","updated":"2024-01-21T23:34:42Z","published":"2024-01-21T23:34:42Z","title":"In-context Learning with Retrieved Demonstrations for Language Models: A\n Survey","summary":" Language models, especially pre-trained large language models, have showcased\nremarkable abilities as few-shot in-context learners (ICL), adept at adapting\nto new tasks with just a few demonstrations in the input context. However, the\nmodel's ability to perform ICL is sensitive to the choice of the few-shot\ndemonstrations. Instead of using a fixed set of demonstrations, one recent\ndevelopment is to retrieve demonstrations tailored to each input query. The\nimplementation of demonstration retrieval is relatively straightforward,\nleveraging existing databases and retrieval systems. This not only improves the\nefficiency and scalability of the learning process but also has been shown to\nreduce biases inherent in manual example selection. In light of the encouraging\nresults and growing research in ICL with retrieved demonstrations, we conduct\nan extensive review of studies in this area. In this survey, we discuss and\ncompare different design choices for retrieval models, retrieval training\nprocedures, and inference algorithms.\n","authors":["an Luo","Xin Xu","Yue Liu","Panupong Pasupat","Mehran Kazemi"],"pdf_url":"https://arxiv.org/pdf/2401.11624v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11509v1","updated":"2024-01-21T14:35:54Z","published":"2024-01-21T14:35:54Z","title":"Simple Domain Adaptation for Sparse Retrievers","summary":" In Information Retrieval, and more generally in Natural Language Processing,\nadapting models to specific domains is conducted through fine-tuning. Despite\nthe successes achieved by this method and its versatility, the need for\nhuman-curated and labeled data makes it impractical to transfer to new tasks,\ndomains, and/or languages when training data doesn't exist. Using the model\nwithout training (zero-shot) is another option that however suffers an\neffectiveness cost, especially in the case of first-stage retrievers. Numerous\nresearch directions have emerged to tackle these issues, most of them in the\ncontext of adapting to a task or a language. However, the literature is scarcer\nfor domain (or topic) adaptation. In this paper, we address this issue of\ncross-topic discrepancy for a sparse first-stage retriever by transposing a\nmethod initially designed for language adaptation. By leveraging pre-training\non the target data to learn domain-specific knowledge, this technique\nalleviates the need for annotated data and expands the scope of domain\nadaptation. Despite their relatively good generalization ability, we show that\neven sparse retrievers can benefit from our simple domain adaptation method.\n","authors":["Mathias Vast","Yuxuan Zong","Basile Van Cooten","Benjamin Piwowarski","Laure Soulier"],"pdf_url":"https://arxiv.org/pdf/2401.11509v1.pdf","comment":"Accepted at ECIR 2024"},{"id":"http://arxiv.org/abs/2401.11506v1","updated":"2024-01-21T14:33:52Z","published":"2024-01-21T14:33:52Z","title":"Enhancing Recommendation Diversity by Re-ranking with Large Language\n Models","summary":" It has long been recognized that it is not enough for a Recommender System\n(RS) to provide recommendations based only on their relevance to users. Among\nmany other criteria, the set of recommendations may need to be diverse in order\nto handle uncertainty and offer a meaningful choice. The literature reports\nmany ways of measuring diversity and ways of improving the diversity of a set\nof recommendations, most notably by re-ranking and selecting from a larger set\nof candidate recommendations. Driven by promising insights from the literature\non how to incorporate versatile Large Language Models (LLMs) into the RS\npipeline, in this paper, we show how LLMs can be used for diversity re-ranking.\n We begin with an informal study that verifies that LLMs can be used for\nre-ranking tasks and do have some understanding of the concept of diversity.\nThen, we design a more rigorous methodology where LLMs are prompted to generate\na diverse ranking from a candidate ranking using various prompt templates with\ndifferent re-ranking instructions in a zero-shot fashion. We conduct\ncomprehensive experiments testing state-of-the-art conversational LLMs from the\nGPT and Llama families. We compare their re-ranking capabilities with random\nre-ranking and various traditional re-ranking methods from the literature (MMR,\nxQuAD and RxQuAD). We find that LLM-based re-ranking outperforms random\nre-ranking across all the metrics that we use but does not perform as well as\nthe traditional re-ranking methods. We gain insight into prompt design for this\ntask (e.g.\\ on the whole, it is better to prompt for diversity rather than a\nbalance of diversity and relevance). Given that no special knowledge\nengineering is needed, we conclude that LLM-based re-ranking is a promising\napproach, and we highlight directions for future research. We open-source the\ncode of our experiments for reproducibility.\n","authors":["Diego Carraro","Derek Bridge"],"pdf_url":"https://arxiv.org/pdf/2401.11506v1.pdf","comment":"32 pages, 2 figures"},{"id":"http://arxiv.org/abs/2401.11505v1","updated":"2024-01-21T14:30:20Z","published":"2024-01-21T14:30:20Z","title":"CheX-GPT: Harnessing Large Language Models for Enhanced Chest X-ray\n Report Labeling","summary":" Free-text radiology reports present a rich data source for various medical\ntasks, but effectively labeling these texts remains challenging. Traditional\nrule-based labeling methods fall short of capturing the nuances of diverse\nfree-text patterns. Moreover, models using expert-annotated data are limited by\ndata scarcity and pre-defined classes, impacting their performance, flexibility\nand scalability. To address these issues, our study offers three main\ncontributions: 1) We demonstrate the potential of GPT as an adept labeler using\ncarefully designed prompts. 2) Utilizing only the data labeled by GPT, we\ntrained a BERT-based labeler, CheX-GPT, which operates faster and more\nefficiently than its GPT counterpart. 3) To benchmark labeler performance, we\nintroduced a publicly available expert-annotated test set, MIMIC-500,\ncomprising 500 cases from the MIMIC validation set. Our findings demonstrate\nthat CheX-GPT not only excels in labeling accuracy over existing models, but\nalso showcases superior efficiency, flexibility, and scalability, supported by\nour introduction of the MIMIC-500 dataset for robust benchmarking. Code and\nmodels are available at https://github.com/kakaobrain/CheXGPT.\n","authors":["Jawook Gu","Han-Cheol Cho","Jiho Kim","Kihyun You","Eun Kyoung Hong","Byungseok Roh"],"pdf_url":"https://arxiv.org/pdf/2401.11505v1.pdf","comment":"16 pages, 3 figures"},{"id":"http://arxiv.org/abs/2401.11478v1","updated":"2024-01-21T12:51:28Z","published":"2024-01-21T12:51:28Z","title":"D2K: Turning Historical Data into Retrievable Knowledge for Recommender\n Systems","summary":" A vast amount of user behavior data is constantly accumulating on today's\nlarge recommendation platforms, recording users' various interests and tastes.\nPreserving knowledge from the old data while new data continually arrives is a\nvital problem for recommender systems. Existing approaches generally seek to\nsave the knowledge implicitly in the model parameters. However, such a\nparameter-centric approach lacks scalability and flexibility -- the capacity is\nhard to scale, and the knowledge is inflexible to utilize. Hence, in this work,\nwe propose a framework that turns massive user behavior data to retrievable\nknowledge (D2K). It is a data-centric approach that is model-agnostic and easy\nto scale up. Different from only storing unary knowledge such as the user-side\nor item-side information, D2K propose to store ternary knowledge for\nrecommendation, which is determined by the complete recommendation factors --\nuser, item, and context. The knowledge retrieved by target samples can be\ndirectly used to enhance the performance of any recommendation algorithms.\nSpecifically, we introduce a Transformer-based knowledge encoder to transform\nthe old data into knowledge with the user-item-context cross features. A\npersonalized knowledge adaptation unit is devised to effectively exploit the\ninformation from the knowledge base by adapting the retrieved knowledge to the\ntarget samples. Extensive experiments on two public datasets show that D2K\nsignificantly outperforms existing baselines and is compatible with a major\ncollection of recommendation algorithms.\n","authors":["Jiarui Qin","Weiwen Liu","Ruiming Tang","Weinan Zhang","Yong Yu"],"pdf_url":"https://arxiv.org/pdf/2401.11478v1.pdf","comment":"12 pages, 7 figures"},{"id":"http://arxiv.org/abs/2401.11463v1","updated":"2024-01-21T11:04:30Z","published":"2024-01-21T11:04:30Z","title":"Estimating the Usefulness of Clarifying Questions and Answers for\n Conversational Search","summary":" While the body of research directed towards constructing and generating\nclarifying questions in mixed-initiative conversational search systems is vast,\nresearch aimed at processing and comprehending users' answers to such questions\nis scarce. To this end, we present a simple yet effective method for processing\nanswers to clarifying questions, moving away from previous work that simply\nappends answers to the original query and thus potentially degrades retrieval\nperformance. Specifically, we propose a classifier for assessing usefulness of\nthe prompted clarifying question and an answer given by the user. Useful\nquestions or answers are further appended to the conversation history and\npassed to a transformer-based query rewriting module. Results demonstrate\nsignificant improvements over strong non-mixed-initiative baselines.\nFurthermore, the proposed approach mitigates the performance drops when non\nuseful questions and answers are utilized.\n","authors":["Ivan Sekulić","Weronika Łajewska","Krisztian Balog","Fabio Crestani"],"pdf_url":"https://arxiv.org/pdf/2401.11463v1.pdf","comment":"This is the author's version of the work. The definitive version is\n published in: Proceedings of the 46th European Conference on Information\n Retrieval (ECIR '24), March 24-28, 2024, Glasgow, Scotland"},{"id":"http://arxiv.org/abs/2401.11452v1","updated":"2024-01-21T10:15:36Z","published":"2024-01-21T10:15:36Z","title":"Towards Reliable and Factual Response Generation: Detecting Unanswerable\n Questions in Information-Seeking Conversations","summary":" Generative AI models face the challenge of hallucinations that can undermine\nusers' trust in such systems. We approach the problem of conversational\ninformation seeking as a two-step process, where relevant passages in a corpus\nare identified first and then summarized into a final system response. This way\nwe can automatically assess if the answer to the user's question is present in\nthe corpus. Specifically, our proposed method employs a sentence-level\nclassifier to detect if the answer is present, then aggregates these\npredictions on the passage level, and eventually across the top-ranked passages\nto arrive at a final answerability estimate. For training and evaluation, we\ndevelop a dataset based on the TREC CAsT benchmark that includes answerability\nlabels on the sentence, passage, and ranking levels. We demonstrate that our\nproposed method represents a strong baseline and outperforms a state-of-the-art\nLLM on the answerability prediction task.\n","authors":["Weronika Łajewska","Krisztian Balog"],"pdf_url":"https://arxiv.org/pdf/2401.11452v1.pdf","comment":"This is the author's version of the work. The definitive version is\n published in: Proceedings of the 46th European Conference on Information\n Retrieval} (ECIR '24), March 24--28, 2024, Glasgow, Scotland"},{"id":"http://arxiv.org/abs/2401.11441v1","updated":"2024-01-21T09:42:24Z","published":"2024-01-21T09:42:24Z","title":"On-Device Recommender Systems: A Comprehensive Survey","summary":" Recommender systems have been widely deployed in various real-world\napplications to help users identify content of interest from massive amounts of\ninformation. Traditional recommender systems work by collecting user-item\ninteraction data in a cloud-based data center and training a centralized model\nto perform the recommendation service. However, such cloud-based recommender\nsystems (CloudRSs) inevitably suffer from excessive resource consumption,\nresponse latency, as well as privacy and security risks concerning both data\nand models. Recently, driven by the advances in storage, communication, and\ncomputation capabilities of edge devices, there has been a shift of focus from\nCloudRSs to on-device recommender systems (DeviceRSs), which leverage the\ncapabilities of edge devices to minimize centralized data storage requirements,\nreduce the response latency caused by communication overheads, and enhance user\nprivacy and security by localizing data processing and model training. Despite\nthe rapid rise of DeviceRSs, there is a clear absence of timely literature\nreviews that systematically introduce, categorize and contrast these methods.\nTo bridge this gap, we aim to provide a comprehensive survey of DeviceRSs,\ncovering three main aspects: (1) the deployment and inference of DeviceRSs (2)\nthe training and update of DeviceRSs (3) the security and privacy of DeviceRSs.\nFurthermore, we provide a fine-grained and systematic taxonomy of the methods\ninvolved in each aspect, followed by a discussion regarding challenges and\nfuture research directions. This is the first comprehensive survey on DeviceRSs\nthat covers a spectrum of tasks to fit various needs. We believe this survey\nwill help readers effectively grasp the current research status in this field,\nequip them with relevant technical foundations, and stimulate new research\nideas for developing DeviceRSs.\n","authors":["Hongzhi Yin","Liang Qu","Tong Chen","Wei Yuan","Ruiqi Zheng","Jing Long","Xin Xia","Yuhui Shi","Chengqi Zhang"],"pdf_url":"https://arxiv.org/pdf/2401.11441v1.pdf","comment":null}],"Machine Learning":[{"id":"http://arxiv.org/abs/2401.11632v1","updated":"2024-01-21T23:56:57Z","published":"2024-01-21T23:56:57Z","title":"What Are We Optimizing For? A Human-centric Evaluation Of Deep\n Learning-based Recommender Systems","summary":" Deep learning-based (DL) models in recommender systems (RecSys) have gained\nsignificant recognition for their remarkable accuracy in predicting user\npreferences. However, their performance often lacks a comprehensive evaluation\nfrom a human-centric perspective, which encompasses various dimensions beyond\nsimple interest matching. In this work, we have developed a robust\nhuman-centric evaluation framework that incorporates seven diverse metrics to\nassess the quality of recommendations generated by five recent open-sourced DL\nmodels. Our evaluation datasets consist of both offline benchmark data and\npersonalized online recommendation feedback collected from 445 real users. We\nfind that (1) different DL models have different pros and cons in the\nmulti-dimensional metrics that we test with; (2) users generally want a\ncombination of accuracy with at least one another human values in the\nrecommendation; (3) the degree of combination of different values needs to be\ncarefully experimented to user preferred level.\n","authors":["Ruixuan Sun","Avinash Akella","Xinyi Wu","Ruoyan Kong","Joseph A. Konstan"],"pdf_url":"https://arxiv.org/pdf/2401.11632v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11631v1","updated":"2024-01-21T23:54:05Z","published":"2024-01-21T23:54:05Z","title":"Text-to-Image Cross-Modal Generation: A Systematic Review","summary":" We review research on generating visual data from text from the angle of\n\"cross-modal generation.\" This point of view allows us to draw parallels\nbetween various methods geared towards working on input text and producing\nvisual output, without limiting the analysis to narrow sub-areas. It also\nresults in the identification of common templates in the field, which are then\ncompared and contrasted both within pools of similar methods and across lines\nof research. We provide a breakdown of text-to-image generation into various\nflavors of image-from-text methods, video-from-text methods, image editing,\nself-supervised and graph-based approaches. In this discussion, we focus on\nresearch papers published at 8 leading machine learning conferences in the\nyears 2016-2022, also incorporating a number of relevant papers not matching\nthe outlined search criteria. The conducted review suggests a significant\nincrease in the number of papers published in the area and highlights research\ngaps and potential lines of investigation. To our knowledge, this is the first\nreview to systematically look at text-to-image generation from the perspective\nof \"cross-modal generation.\"\n","authors":["Maciej Żelaszczyk","Jacek Mańdziuk"],"pdf_url":"https://arxiv.org/pdf/2401.11631v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11630v1","updated":"2024-01-21T23:50:46Z","published":"2024-01-21T23:50:46Z","title":"Reframing Offline Reinforcement Learning as a Regression Problem","summary":" The study proposes the reformulation of offline reinforcement learning as a\nregression problem that can be solved with decision trees. Aiming to predict\nactions based on input states, return-to-go (RTG), and timestep information, we\nobserve that with gradient-boosted trees, the agent training and inference are\nvery fast, the former taking less than a minute. Despite the simplification\ninherent in this reformulated problem, our agent demonstrates performance that\nis at least on par with established methods. This assertion is validated by\ntesting it across standard datasets associated with D4RL Gym-MuJoCo tasks. We\nfurther discuss the agent's ability to generalize by testing it on two extreme\ncases, how it learns to model the return distributions effectively even with\nhighly skewed expert datasets, and how it exhibits robust performance in\nscenarios with sparse/delayed rewards.\n","authors":["Prajwal Koirala","Cody Fleming"],"pdf_url":"https://arxiv.org/pdf/2401.11630v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11627v1","updated":"2024-01-21T23:41:32Z","published":"2024-01-21T23:41:32Z","title":"Tight Verification of Probabilistic Robustness in Bayesian Neural\n Networks","summary":" We introduce two algorithms for computing tight guarantees on the\nprobabilistic robustness of Bayesian Neural Networks (BNNs). Computing\nrobustness guarantees for BNNs is a significantly more challenging task than\nverifying the robustness of standard Neural Networks (NNs) because it requires\nsearching the parameters' space for safe weights. Moreover, tight and complete\napproaches for the verification of standard NNs, such as those based on\nMixed-Integer Linear Programming (MILP), cannot be directly used for the\nverification of BNNs because of the polynomial terms resulting from the\nconsecutive multiplication of variables encoding the weights. Our algorithms\nefficiently and effectively search the parameters' space for safe weights by\nusing iterative expansion and the network's gradient and can be used with any\nverification algorithm of choice for BNNs. In addition to proving that our\nalgorithms compute tighter bounds than the SoA, we also evaluate our algorithms\nagainst the SoA on standard benchmarks, such as MNIST and CIFAR10, showing that\nour algorithms compute bounds up to 40% tighter than the SoA.\n","authors":["Ben Batten","Mehran Hosseini","Alessio Lomuscio"],"pdf_url":"https://arxiv.org/pdf/2401.11627v1.pdf","comment":"Accepted at AISTATS 2024"},{"id":"http://arxiv.org/abs/2401.11626v1","updated":"2024-01-21T23:37:33Z","published":"2024-01-21T23:37:33Z","title":"Freely Long-Thinking Transformer (FraiLT)","summary":" Freely Long-Thinking Transformer (FraiLT) is an improved transformer model\ndesigned to enhance processing capabilities without scaling up size. It\nutilizes a recursive approach, iterating over a subset of layers multiple\ntimes, and introduces iteration encodings to maintain awareness across these\ncycles. Iteration encoding allows FraiLT to achieve the interpretive depth of\nlarger models in a compact form. When evaluated on a synthetic story dataset,\nFraiLT outperformed larger models, showcasing its ability to deliver\nhigh-quality performance while reducing memory demands. This model represents a\nstep forward towards more efficient and accessible language models.\n","authors":["Akbay Tabak"],"pdf_url":"https://arxiv.org/pdf/2401.11626v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11618v1","updated":"2024-01-21T22:55:26Z","published":"2024-01-21T22:55:26Z","title":"Efficient local linearity regularization to overcome catastrophic\n overfitting","summary":" Catastrophic overfitting (CO) in single-step adversarial training (AT)\nresults in abrupt drops in the adversarial test accuracy (even down to 0%). For\nmodels trained with multi-step AT, it has been observed that the loss function\nbehaves locally linearly with respect to the input, this is however lost in\nsingle-step AT. To address CO in single-step AT, several methods have been\nproposed to enforce local linearity of the loss via regularization. However,\nthese regularization terms considerably slow down training due to Double\nBackpropagation. Instead, in this work, we introduce a regularization term,\ncalled ELLE, to mitigate CO effectively and efficiently in classical AT\nevaluations, as well as some more difficult regimes, e.g., large adversarial\nperturbations and long training schedules. Our regularization term can be\ntheoretically linked to curvature of the loss function and is computationally\ncheaper than previous methods by avoiding Double Backpropagation. Our thorough\nexperimental validation demonstrates that our work does not suffer from CO,\neven in challenging settings where previous works suffer from it. We also\nnotice that adapting our regularization parameter during training (ELLE-A)\ngreatly improves the performance, specially in large $\\epsilon$ setups. Our\nimplementation is available in https://github.com/LIONS-EPFL/ELLE .\n","authors":["Elias Abad Rocamora","Fanghui Liu","Grigorios G. Chrysos","Pablo M. Olmos","Volkan Cevher"],"pdf_url":"https://arxiv.org/pdf/2401.11618v1.pdf","comment":"Accepted in ICLR 2024"},{"id":"http://arxiv.org/abs/2310.19491v2","updated":"2024-01-21T22:35:34Z","published":"2023-10-30T12:28:53Z","title":"Generator Identification for Linear SDEs with Additive and\n Multiplicative Noise","summary":" In this paper, we present conditions for identifying the generator of a\nlinear stochastic differential equation (SDE) from the distribution of its\nsolution process with a given fixed initial state. These identifiability\nconditions are crucial in causal inference using linear SDEs as they enable the\nidentification of the post-intervention distributions from its observational\ndistribution. Specifically, we derive a sufficient and necessary condition for\nidentifying the generator of linear SDEs with additive noise, as well as a\nsufficient condition for identifying the generator of linear SDEs with\nmultiplicative noise. We show that the conditions derived for both types of\nSDEs are generic. Moreover, we offer geometric interpretations of the derived\nidentifiability conditions to enhance their understanding. To validate our\ntheoretical results, we perform a series of simulations, which support and\nsubstantiate the established findings.\n","authors":["Yuanyuan Wang","Xi Geng","Wei Huang","Biwei Huang","Mingming Gong"],"pdf_url":"https://arxiv.org/pdf/2310.19491v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11611v1","updated":"2024-01-21T22:18:29Z","published":"2024-01-21T22:18:29Z","title":"Continuous Field Reconstruction from Sparse Observations with Implicit\n Neural Networks","summary":" Reliably reconstructing physical fields from sparse sensor data is a\nchallenge that frequently arises in many scientific domains. In practice, the\nprocess generating the data often is not understood to sufficient accuracy.\nTherefore, there is a growing interest in using the deep neural network route\nto address the problem. This work presents a novel approach that learns a\ncontinuous representation of the physical field using implicit neural\nrepresentations (INRs). Specifically, after factorizing spatiotemporal\nvariability into spatial and temporal components using the separation of\nvariables technique, the method learns relevant basis functions from sparsely\nsampled irregular data points to develop a continuous representation of the\ndata. In experimental evaluations, the proposed model outperforms recent INR\nmethods, offering superior reconstruction quality on simulation data from a\nstate-of-the-art climate model and a second dataset that comprises ultra-high\nresolution satellite-based sea surface temperature fields.\n","authors":["Xihaier Luo","Wei Xu","Yihui Ren","Shinjae Yoo","Balu Nadiga"],"pdf_url":"https://arxiv.org/pdf/2401.11611v1.pdf","comment":"25 pages,21 figures"},{"id":"http://arxiv.org/abs/2401.11609v1","updated":"2024-01-21T22:11:29Z","published":"2024-01-21T22:11:29Z","title":"Graph Edits for Counterfactual Explanations: A Unified GNN Approach","summary":" Counterfactuals have been established as a popular explainability technique\nwhich leverages a set of minimal edits to alter the prediction of a classifier.\nWhen considering conceptual counterfactuals, the edits requested should\ncorrespond to salient concepts present in the input data. At the same time,\nconceptual distances are defined by knowledge graphs, ensuring the optimality\nof conceptual edits. In this work, we extend previous endeavors on conceptual\ncounterfactuals by introducing \\textit{graph edits as counterfactual\nexplanations}: should we represent input data as graphs, which is the shortest\ngraph edit path that results in an alternative classification label as provided\nby a black-box classifier?\n","authors":["Nikolaos Chaidos","Angeliki Dimitriou","Maria Lymperaiou","Giorgos Stamou"],"pdf_url":"https://arxiv.org/pdf/2401.11609v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.07364v2","updated":"2024-01-21T22:08:20Z","published":"2024-01-14T20:41:36Z","title":"PDE Generalization of In-Context Operator Networks: A Study on 1D Scalar\n Nonlinear Conservation Laws","summary":" Can we build a single large model for a wide range of PDE-related scientific\nlearning tasks? Can this model generalize to new PDEs, even of new forms,\nwithout any fine-tuning? In-context operator learning and the corresponding\nmodel In-Context Operator Networks (ICON) represent an initial exploration of\nthese questions. The capability of ICON regarding the first question has been\ndemonstrated previously. In this paper, we present a detailed methodology for\nsolving PDE problems with ICON, and show how a single ICON model can make\nforward and reverse predictions for different equations with different strides,\nprovided with appropriately designed data prompts. We show the positive\nevidence to the second question, i.e., ICON can generalize well to some PDEs\nwith new forms without any fine-tuning. This is exemplified through a study on\n1D scalar nonlinear conservation laws, a family of PDEs with temporal\nevolution. We also show how to broaden the range of problems that an ICON model\ncan address, by transforming functions and equations to ICON's capability\nscope. We believe that the progress in this paper is a significant step towards\nthe goal of training a foundation model for PDE-related tasks under the\nin-context operator learning framework.\n","authors":["Liu Yang","Stanley J. Osher"],"pdf_url":"https://arxiv.org/pdf/2401.07364v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11608v1","updated":"2024-01-21T22:01:34Z","published":"2024-01-21T22:01:34Z","title":"$\\texttt{immrax}$: A Parallelizable and Differentiable Toolbox for\n Interval Analysis and Mixed Monotone Reachability in JAX","summary":" We present an implementation of interval analysis and mixed monotone interval\nreachability analysis as function transforms in Python, fully composable with\nthe computational framework JAX. The resulting toolbox inherits several key\nfeatures from JAX, including computational efficiency through Just-In-Time\nCompilation, GPU acceleration for quick parallelized computations, and\nAutomatic Differentiability. We demonstrate the toolbox's performance on\nseveral case studies, including a reachability problem on a vehicle model\ncontrolled by a neural network, and a robust closed-loop optimal control\nproblem for a swinging pendulum.\n","authors":["Akash Harapanahalli","Saber Jafarpour","Samuel Coogan"],"pdf_url":"https://arxiv.org/pdf/2401.11608v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11605v1","updated":"2024-01-21T21:49:49Z","published":"2024-01-21T21:49:49Z","title":"Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass\n Diffusion Transformers","summary":" We present the Hourglass Diffusion Transformer (HDiT), an image generative\nmodel that exhibits linear scaling with pixel count, supporting training at\nhigh-resolution (e.g. $1024 \\times 1024$) directly in pixel-space. Building on\nthe Transformer architecture, which is known to scale to billions of\nparameters, it bridges the gap between the efficiency of convolutional U-Nets\nand the scalability of Transformers. HDiT trains successfully without typical\nhigh-resolution training techniques such as multiscale architectures, latent\nautoencoders or self-conditioning. We demonstrate that HDiT performs\ncompetitively with existing models on ImageNet $256^2$, and sets a new\nstate-of-the-art for diffusion models on FFHQ-$1024^2$.\n","authors":["Katherine Crowson","Stefan Andreas Baumann","Alex Birch","Tanishq Mathew Abraham","Daniel Z. Kaplan","Enrico Shippole"],"pdf_url":"https://arxiv.org/pdf/2401.11605v1.pdf","comment":"20 pages, 13 figures, project page and code available at\n https://crowsonkb.github.io/hourglass-diffusion-transformers/"},{"id":"http://arxiv.org/abs/2312.02063v2","updated":"2024-01-21T21:41:32Z","published":"2023-12-04T17:19:37Z","title":"The GPU Phase Folding and Deep Learning Method for Detecting Exoplanet\n Transits","summary":" This paper presents GPFC, a novel Graphics Processing Unit (GPU) Phase\nFolding and Convolutional Neural Network (CNN) system to detect exoplanets\nusing the transit method. We devise a fast folding algorithm parallelized on a\nGPU to amplify low signal-to-noise ratio transit signals, allowing a search at\nhigh precision and speed. A CNN trained on two million synthetic light curves\nreports a score indicating the likelihood of a planetary signal at each period.\nWhile the GPFC method has broad applicability across period ranges, this\nresearch specifically focuses on detecting ultra-short-period planets with\norbital periods less than one day. GPFC improves on speed by three orders of\nmagnitude over the predominant Box-fitting Least Squares (BLS) method. Our\nsimulation results show GPFC achieves $97%$ training accuracy, higher true\npositive rate at the same false positive rate of detection, and higher\nprecision at the same recall rate when compared to BLS. GPFC recovers $100\\%$\nof known ultra-short-period planets in $\\textit{Kepler}$ light curves from a\nblind search. These results highlight the promise of GPFC as an alternative\napproach to the traditional BLS algorithm for finding new transiting exoplanets\nin data taken with $\\textit{Kepler}$ and other space transit missions such as\nK2, TESS and future PLATO and Earth 2.0.\n","authors":["Kaitlyn Wang","Jian Ge","Kevin Willis","Kevin Wang","Yinan Zhao"],"pdf_url":"https://arxiv.org/pdf/2312.02063v2.pdf","comment":"16 pages, 19 figures; Accepted for publication in the peer-reviewed\n journal, Monthly Notices of the Royal Astronomical Society (MNRAS), on\n January 20, 2024"}]},"2024-01-20T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2305.14189v3","updated":"2024-01-20T22:29:15Z","published":"2023-05-23T16:11:00Z","title":"Beyond Shared Vocabulary: Increasing Representational Word Similarities\n across Languages for Multilingual Machine Translation","summary":" Using a vocabulary that is shared across languages is common practice in\nMultilingual Neural Machine Translation (MNMT). In addition to its simple\ndesign, shared tokens play an important role in positive knowledge transfer,\nassuming that shared tokens refer to similar meanings across languages.\nHowever, when word overlap is small, especially due to different writing\nsystems, transfer is inhibited. In this paper, we define word-level information\ntransfer pathways via word equivalence classes and rely on graph networks to\nfuse word embeddings across languages. Our experiments demonstrate the\nadvantages of our approach: 1) embeddings of words with similar meanings are\nbetter aligned across languages, 2) our method achieves consistent BLEU\nimprovements of up to 2.3 points for high- and low-resource MNMT, and 3) less\nthan 1.0\\% additional trainable parameters are required with a limited increase\nin computational costs, while inference time remains identical to the baseline.\nWe release the codebase to the community.\n","authors":["Di Wu","Christof Monz"],"pdf_url":"https://arxiv.org/pdf/2305.14189v3.pdf","comment":"15 pages, 3 figures"},{"id":"http://arxiv.org/abs/2401.07510v3","updated":"2024-01-20T22:08:18Z","published":"2024-01-15T07:21:16Z","title":"Developing ChatGPT for Biology and Medicine: A Complete Review of\n Biomedical Question Answering","summary":" ChatGPT explores a strategic blueprint of question answering (QA) in\ndelivering medical diagnosis, treatment recommendations, and other healthcare\nsupport. This is achieved through the increasing incorporation of medical\ndomain data via natural language processing (NLP) and multimodal paradigms. By\ntransitioning the distribution of text, images, videos, and other modalities\nfrom the general domain to the medical domain, these techniques have expedited\nthe progress of medical domain question answering (MDQA). They bridge the gap\nbetween human natural language and sophisticated medical domain knowledge or\nexpert manual annotations, handling large-scale, diverse, unbalanced, or even\nunlabeled data analysis scenarios in medical contexts. Central to our focus is\nthe utilizing of language models and multimodal paradigms for medical question\nanswering, aiming to guide the research community in selecting appropriate\nmechanisms for their specific medical research requirements. Specialized tasks\nsuch as unimodal-related question answering, reading comprehension, reasoning,\ndiagnosis, relation extraction, probability modeling, and others, as well as\nmultimodal-related tasks like vision question answering, image caption,\ncross-modal retrieval, report summarization, and generation, are discussed in\ndetail. Each section delves into the intricate specifics of the respective\nmethod under consideration. This paper highlights the structures and\nadvancements of medical domain explorations against general domain methods,\nemphasizing their applications across different tasks and datasets. It also\noutlines current challenges and opportunities for future medical domain\nresearch, paving the way for continued innovation and application in this\nrapidly evolving field.\n","authors":["Qing Li","Lei Li","Yu Li"],"pdf_url":"https://arxiv.org/pdf/2401.07510v3.pdf","comment":"50 pages, 3 figures, 3 tables"},{"id":"http://arxiv.org/abs/2312.02317v3","updated":"2024-01-20T21:16:09Z","published":"2023-12-04T19:58:07Z","title":"GNN2R: Weakly-Supervised Rationale-Providing Question Answering over\n Knowledge Graphs","summary":" Most current methods for multi-hop question answering (QA) over knowledge\ngraphs (KGs) only provide final conclusive answers without explanations, such\nas a set of KG entities that is difficult for normal users to review and\ncomprehend. This issue severely limits the application of KG-based QA in\nreal-world scenarios. However, it is non-trivial to solve due to two\nchallenges: First, annotations of reasoning chains of multi-hop questions,\nwhich could serve as supervision for explanation generation, are usually\nlacking. Second, it is difficult to maintain high efficiency when explicit KG\ntriples need to be retrieved to generate explanations. In this paper, we\npropose a novel Graph Neural Network-based Two-Step Reasoning model (GNN2R) to\nsolve this issue. GNN2R can provide both final answers and reasoning subgraphs\nas a rationale behind final answers efficiently with only weak supervision that\nis available through question-final answer pairs. We extensively evaluated\nGNN2R with detailed analyses in experiments. The results demonstrate that, in\nterms of effectiveness, efficiency, and quality of generated explanations,\nGNN2R outperforms existing state-of-the-art methods that are applicable to this\ntask. Our code and pre-trained models are available at\nhttps://github.com/ruijie-wang-uzh/GNN2R.\n","authors":["Ruijie Wang","Luca Rossetto","Michael Cochez","Abraham Bernstein"],"pdf_url":"https://arxiv.org/pdf/2312.02317v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11323v1","updated":"2024-01-20T20:55:21Z","published":"2024-01-20T20:55:21Z","title":"Analyzing Task-Encoding Tokens in Large Language Models","summary":" In-context learning (ICL) has become an effective solution for few-shot\nlearning in natural language processing. Past work has found that, during this\nprocess, representations of the last prompt token are utilized to store task\nreasoning procedures, thereby explaining the working mechanism of in-context\nlearning. In this paper, we seek to locate and analyze other task-encoding\ntokens whose representations store task reasoning procedures. Supported by\nexperiments that ablate the representations of different token types, we find\nthat template and stopword tokens are the most prone to be task-encoding\ntokens. In addition, we demonstrate experimentally that lexical cues,\nrepetition, and text formats are the main distinguishing characteristics of\nthese tokens. Our work provides additional insights into how large language\nmodels (LLMs) leverage task reasoning procedures in ICL and suggests that\nfuture work may involve using task-encoding tokens to improve the computational\nefficiency of LLMs at inference time and their ability to handle long\nsequences.\n","authors":["Yu Bai","Heyan Huang","Cesare Spinoso-Di Piano","Marc-Antoine Rondeau","Sanxing Chen","Yang Gao","Jackie Chi Kit Cheung"],"pdf_url":"https://arxiv.org/pdf/2401.11323v1.pdf","comment":"Work in progress"},{"id":"http://arxiv.org/abs/2401.11316v1","updated":"2024-01-20T20:25:17Z","published":"2024-01-20T20:25:17Z","title":"PRILoRA: Pruned and Rank-Increasing Low-Rank Adaptation","summary":" With the proliferation of large pre-trained language models (PLMs),\nfine-tuning all model parameters becomes increasingly inefficient, particularly\nwhen dealing with numerous downstream tasks that entail substantial training\nand storage costs. Several approaches aimed at achieving parameter-efficient\nfine-tuning (PEFT) have been proposed. Among them, Low-Rank Adaptation (LoRA)\nstands out as an archetypal method, incorporating trainable rank decomposition\nmatrices into each target module. Nevertheless, LoRA does not consider the\nvarying importance of each layer. To address these challenges, we introduce\nPRILoRA, which linearly allocates a different rank for each layer, in an\nincreasing manner, and performs pruning throughout the training process,\nconsidering both the temporary magnitude of weights and the accumulated\nstatistics of the input to any given layer. We validate the effectiveness of\nPRILoRA through extensive experiments on eight GLUE benchmarks, setting a new\nstate of the art.\n","authors":["Nadav Benedek","Lior Wolf"],"pdf_url":"https://arxiv.org/pdf/2401.11316v1.pdf","comment":"EACL 2024"},{"id":"http://arxiv.org/abs/2401.11305v1","updated":"2024-01-20T19:32:56Z","published":"2024-01-20T19:32:56Z","title":"Progress in Privacy Protection: A Review of Privacy Preserving\n Techniques in Recommender Systems, Edge Computing, and Cloud Computing","summary":" As digital technology evolves, the increasing use of connected devices brings\nboth challenges and opportunities in the areas of mobile crowdsourcing, edge\ncomputing, and recommender systems. This survey focuses on these dynamic\nfields, emphasizing the critical need for privacy protection in our\nincreasingly data-oriented world. It explores the latest trends in these\ninterconnected areas, with a special emphasis on privacy and data security. Our\nmethod involves an in-depth analysis of various academic works, which helps us\nto gain a comprehensive understanding of these sectors and their shifting focus\ntowards privacy concerns. We present new insights and marks a significant\nadvancement in addressing privacy issues within these technologies. The survey\nis a valuable resource for researchers, industry practitioners, and policy\nmakers, offering an extensive overview of these fields and their related\nprivacy challenges, catering to a wide audience in the modern digital era.\n","authors":["Syed Raza Bashir","Shaina Raza","Vojislav Misic"],"pdf_url":"https://arxiv.org/pdf/2401.11305v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.04925v3","updated":"2024-01-20T17:23:31Z","published":"2024-01-10T04:37:38Z","title":"The Impact of Reasoning Step Length on Large Language Models","summary":" Chain of Thought (CoT) is significant in improving the reasoning abilities of\nlarge language models (LLMs). However, the correlation between the\neffectiveness of CoT and the length of reasoning steps in prompts remains\nlargely unknown. To shed light on this, we have conducted several empirical\nexperiments to explore the relations. Specifically, we design experiments that\nexpand and compress the rationale reasoning steps within CoT demonstrations,\nwhile keeping all other factors constant. We have the following key findings.\nFirst, the results indicate that lengthening the reasoning steps in prompts,\neven without adding new information into the prompt, considerably enhances\nLLMs' reasoning abilities across multiple datasets. Alternatively, shortening\nthe reasoning steps, even while preserving the key information, significantly\ndiminishes the reasoning abilities of models. This finding highlights the\nimportance of the number of steps in CoT prompts and provides practical\nguidance to make better use of LLMs' potential in complex problem-solving\nscenarios. Second, we also investigated the relationship between the\nperformance of CoT and the rationales used in demonstrations. Surprisingly, the\nresult shows that even incorrect rationales can yield favorable outcomes if\nthey maintain the requisite length of inference. Third, we observed that the\nadvantages of increasing reasoning steps are task-dependent: simpler tasks\nrequire fewer steps, whereas complex tasks gain significantly from longer\ninference sequences.\n","authors":["Mingyu Jin","Qinkai Yu","Dong Shu","Haiyan Zhao","Wenyue Hua","Yanda Meng","Yongfeng Zhang","Mengnan Du"],"pdf_url":"https://arxiv.org/pdf/2401.04925v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11268v1","updated":"2024-01-20T16:48:55Z","published":"2024-01-20T16:48:55Z","title":"Word-Level ASR Quality Estimation for Efficient Corpus Sampling and\n Post-Editing through Analyzing Attentions of a Reference-Free Metric","summary":" In the realm of automatic speech recognition (ASR), the quest for models that\nnot only perform with high accuracy but also offer transparency in their\ndecision-making processes is crucial. The potential of quality estimation (QE)\nmetrics is introduced and evaluated as a novel tool to enhance explainable\nartificial intelligence (XAI) in ASR systems. Through experiments and analyses,\nthe capabilities of the NoRefER (No Reference Error Rate) metric are explored\nin identifying word-level errors to aid post-editors in refining ASR\nhypotheses. The investigation also extends to the utility of NoRefER in the\ncorpus-building process, demonstrating its effectiveness in augmenting datasets\nwith insightful annotations. The diagnostic aspects of NoRefER are examined,\nrevealing its ability to provide valuable insights into model behaviors and\ndecision patterns. This has proven beneficial for prioritizing hypotheses in\npost-editing workflows and fine-tuning ASR models. The findings suggest that\nNoRefER is not merely a tool for error detection but also a comprehensive\nframework for enhancing ASR systems' transparency, efficiency, and\neffectiveness. To ensure the reproducibility of the results, all source codes\nof this study are made publicly available.\n","authors":["Golara Javadi","Kamer Ali Yuksel","Yunsu Kim","Thiago Castro Ferreira","Mohamed Al-Badrashiny"],"pdf_url":"https://arxiv.org/pdf/2401.11268v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11248v1","updated":"2024-01-20T15:02:33Z","published":"2024-01-20T15:02:33Z","title":"Drop your Decoder: Pre-training with Bag-of-Word Prediction for Dense\n Passage Retrieval","summary":" Masked auto-encoder pre-training has emerged as a prevalent technique for\ninitializing and enhancing dense retrieval systems. It generally utilizes\nadditional Transformer decoder blocks to provide sustainable supervision\nsignals and compress contextual information into dense representations.\nHowever, the underlying reasons for the effectiveness of such a pre-training\ntechnique remain unclear. The usage of additional Transformer-based decoders\nalso incurs significant computational costs. In this study, we aim to shed\nlight on this issue by revealing that masked auto-encoder (MAE) pre-training\nwith enhanced decoding significantly improves the term coverage of input tokens\nin dense representations, compared to vanilla BERT checkpoints. Building upon\nthis observation, we propose a modification to the traditional MAE by replacing\nthe decoder of a masked auto-encoder with a completely simplified Bag-of-Word\nprediction task. This modification enables the efficient compression of lexical\nsignals into dense representations through unsupervised pre-training.\nRemarkably, our proposed method achieves state-of-the-art retrieval performance\non several large-scale retrieval benchmarks without requiring any additional\nparameters, which provides a 67% training speed-up compared to standard masked\nauto-encoder pre-training with enhanced decoding.\n","authors":["Guangyuan Ma","Xing Wu","Zijia Lin","Songlin Hu"],"pdf_url":"https://arxiv.org/pdf/2401.11248v1.pdf","comment":"Working in progress. Our code will be available at\n https://github.com/ma787639046/bowdpr"},{"id":"http://arxiv.org/abs/2312.03122v3","updated":"2024-01-20T15:02:20Z","published":"2023-12-05T20:41:34Z","title":"Assertion Enhanced Few-Shot Learning: Instructive Technique for Large\n Language Models to Generate Educational Explanations","summary":" Human educators possess an intrinsic ability to anticipate and seek\neducational explanations from students, which drives them to pose\nthought-provoking questions when students cannot articulate these explanations\nindependently. We aim to imbue Intelligent Tutoring Systems with this ability\nusing few-shot learning capability of Large Language Models. Our work proposes\na novel prompting technique, Assertion Enhanced Few-Shot Learning, to\nfacilitate the generation of accurate, detailed oriented educational\nexplanations. Our central hypothesis is that, in educational domain, few-shot\ndemonstrations are necessary but not a sufficient condition for quality\nexplanation generation. We conducted a study involving 12 in-service teachers,\ncomparing our approach to Traditional Few-Shot Learning. The results show that\nAssertion Enhanced Few-Shot Learning improves explanation accuracy by 15% and\nyields higher-quality explanations, as evaluated by teachers. We also conduct a\nqualitative ablation study to factor the impact of assertions to provide\neducator-friendly prompting guidelines for generating explanations in their\ndomain of interest.\n","authors":["Tasmia Shahriar","Kelly Ramos","Noboru Matsuda"],"pdf_url":"https://arxiv.org/pdf/2312.03122v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.09333v2","updated":"2024-01-20T15:01:01Z","published":"2024-01-17T16:57:18Z","title":"Machines Do See Color: A Guideline to Classify Different Forms of Racist\n Discourse in Large Corpora","summary":" Current methods to identify and classify racist language in text rely on\nsmall-n qualitative approaches or large-n approaches focusing exclusively on\novert forms of racist discourse. This article provides a step-by-step\ngeneralizable guideline to identify and classify different forms of racist\ndiscourse in large corpora. In our approach, we start by conceptualizing racism\nand its different manifestations. We then contextualize these racist\nmanifestations to the time and place of interest, which allows researchers to\nidentify their discursive form. Finally, we apply XLM-RoBERTa (XLM-R), a\ncross-lingual model for supervised text classification with a cutting-edge\ncontextual understanding of text. We show that XLM-R and XLM-R-Racismo, our\npretrained model, outperform other state-of-the-art approaches in classifying\nracism in large corpora. We illustrate our approach using a corpus of tweets\nrelating to the Ecuadorian ind\\'igena community between 2018 and 2021.\n","authors":["Diana Davila Gordillo","Joan Timoneda","Sebastian Vallejo Vera"],"pdf_url":"https://arxiv.org/pdf/2401.09333v2.pdf","comment":"37 pages, 5 figures, 4 tables"},{"id":"http://arxiv.org/abs/2401.11246v1","updated":"2024-01-20T14:59:43Z","published":"2024-01-20T14:59:43Z","title":"Prompt-RAG: Pioneering Vector Embedding-Free Retrieval-Augmented\n Generation in Niche Domains, Exemplified by Korean Medicine","summary":" We propose a natural language prompt-based retrieval augmented generation\n(Prompt-RAG), a novel approach to enhance the performance of generative large\nlanguage models (LLMs) in niche domains. Conventional RAG methods mostly\nrequire vector embeddings, yet the suitability of generic LLM-based embedding\nrepresentations for specialized domains remains uncertain. To explore and\nexemplify this point, we compared vector embeddings from Korean Medicine (KM)\nand Conventional Medicine (CM) documents, finding that KM document embeddings\ncorrelated more with token overlaps and less with human-assessed document\nrelatedness, in contrast to CM embeddings. Prompt-RAG, distinct from\nconventional RAG models, operates without the need for embedding vectors. Its\nperformance was assessed through a Question-Answering (QA) chatbot application,\nwhere responses were evaluated for relevance, readability, and informativeness.\nThe results showed that Prompt-RAG outperformed existing models, including\nChatGPT and conventional vector embedding-based RAGs, in terms of relevance and\ninformativeness. Despite challenges like content structuring and response\nlatency, the advancements in LLMs are expected to encourage the use of\nPrompt-RAG, making it a promising tool for other domains in need of RAG\nmethods.\n","authors":["Bongsu Kang","Jundong Kim","Tae-Rim Yun","Chang-Eop Kim"],"pdf_url":"https://arxiv.org/pdf/2401.11246v1.pdf","comment":"26 pages, 4 figures, 5 tables"},{"id":"http://arxiv.org/abs/2305.16326v2","updated":"2024-01-20T14:33:54Z","published":"2023-05-10T13:40:06Z","title":"Large language models in biomedical natural language processing:\n benchmarks, baselines, and recommendations","summary":" Biomedical literature is growing rapidly, making it challenging to curate and\nextract knowledge manually. Biomedical natural language processing (BioNLP)\ntechniques that can automatically extract information from biomedical\nliterature help alleviate this burden. Recently, large Language Models (LLMs),\nsuch as GPT-3 and GPT-4, have gained significant attention for their impressive\nperformance. However, their effectiveness in BioNLP tasks and impact on method\ndevelopment and downstream users remain understudied. This pilot study (1)\nestablishes the baseline performance of GPT-3 and GPT-4 at both zero-shot and\none-shot settings in eight BioNLP datasets across four applications: named\nentity recognition, relation extraction, multi-label document classification,\nand semantic similarity and reasoning, (2) examines the errors produced by the\nLLMs and categorized the errors into three types: missingness, inconsistencies,\nand unwanted artificial content, and (3) provides suggestions for using LLMs in\nBioNLP applications. We make the datasets, baselines, and results publicly\navailable to the community via\nhttps://github.com/qingyu-qc/gpt_bionlp_benchmark.\n","authors":["Qingyu Chen","Jingcheng Du","Yan Hu","Vipina Kuttichi Keloth","Xueqing Peng","Kalpana Raja","Rui Zhang","Zhiyong Lu","Hua Xu"],"pdf_url":"https://arxiv.org/pdf/2305.16326v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.17080v2","updated":"2024-01-20T14:08:16Z","published":"2023-12-28T15:49:43Z","title":"MR-GSM8K: A Meta-Reasoning Revolution in Large Language Model Evaluation","summary":" In this work, we introduce a novel evaluation paradigm for Large Language\nModels, one that challenges them to engage in meta-reasoning. This approach\naddresses critical shortcomings in existing math problem-solving benchmarks,\ntraditionally used to evaluate the cognitive capabilities of agents. Our\nparadigm shifts the focus from result-oriented assessments, which often\noverlook the reasoning process, to a more holistic evaluation that effectively\ndifferentiates the cognitive capabilities among models. For example, in our\nbenchmark, GPT-4 demonstrates a performance five times better than GPT3-5. The\nsignificance of this new paradigm lies in its ability to reveal potential\ncognitive deficiencies in LLMs that current benchmarks, such as GSM8K, fail to\nuncover due to their saturation and lack of effective differentiation among\nvarying reasoning abilities. Our comprehensive analysis includes several\nstate-of-the-art math models from both open-source and closed-source\ncommunities, uncovering fundamental deficiencies in their training and\nevaluation approaches. This paper not only advocates for a paradigm shift in\nthe assessment of LLMs but also contributes to the ongoing discourse on the\ntrajectory towards Artificial General Intelligence (AGI). By promoting the\nadoption of meta-reasoning evaluation methods similar to ours, we aim to\nfacilitate a more accurate assessment of the true cognitive abilities of LLMs.\n","authors":["Zhongshen Zeng","Pengguang Chen","Shu Liu","Haiyun Jiang","Jiaya Jia"],"pdf_url":"https://arxiv.org/pdf/2312.17080v2.pdf","comment":"Code: https://github.com/dvlab-research/MR-GSM8K"},{"id":"http://arxiv.org/abs/2401.05949v3","updated":"2024-01-20T13:46:33Z","published":"2024-01-11T14:38:19Z","title":"Universal Vulnerabilities in Large Language Models: In-context Learning\n Backdoor Attacks","summary":" In-context learning, a paradigm bridging the gap between pre-training and\nfine-tuning, has demonstrated high efficacy in several NLP tasks, especially in\nfew-shot settings. Unlike traditional fine-tuning methods, in-context learning\nadapts pre-trained models to unseen tasks without updating any parameters.\nDespite being widely applied, in-context learning is vulnerable to malicious\nattacks. In this work, we raise security concerns regarding this paradigm. Our\nstudies demonstrate that an attacker can manipulate the behavior of large\nlanguage models by poisoning the demonstration context, without the need for\nfine-tuning the model. Specifically, we have designed a new backdoor attack\nmethod, named ICLAttack, to target large language models based on in-context\nlearning. Our method encompasses two types of attacks: poisoning demonstration\nexamples and poisoning prompts, which can make models behave in accordance with\npredefined intentions. ICLAttack does not require additional fine-tuning to\nimplant a backdoor, thus preserving the model's generality. Furthermore, the\npoisoned examples are correctly labeled, enhancing the natural stealth of our\nattack method. Extensive experimental results across several language models,\nranging in size from 1.3B to 40B parameters, demonstrate the effectiveness of\nour attack method, exemplified by a high average attack success rate of 95.0%\nacross the three datasets on OPT models. Our findings highlight the\nvulnerabilities of language models, and we hope this work will raise awareness\nof the possible security threats associated with in-context learning.\n","authors":["Shuai Zhao","Meihuizi Jia","Luu Anh Tuan","Jinming Wen"],"pdf_url":"https://arxiv.org/pdf/2401.05949v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.04620v3","updated":"2024-01-20T13:04:29Z","published":"2024-01-09T15:44:44Z","title":"Agent Alignment in Evolving Social Norms","summary":" Agents based on Large Language Models (LLMs) are increasingly permeating\nvarious domains of human production and life, highlighting the importance of\naligning them with human values. The current alignment of AI systems primarily\nfocuses on passively aligning LLMs through human intervention. However, agents\npossess characteristics like receiving environmental feedback and\nself-evolution, rendering the LLM alignment methods inadequate. In response, we\npropose an evolutionary framework for agent evolution and alignment, named\nEvolutionaryAgent, which transforms agent alignment into a process of evolution\nand selection under the principle of survival of the fittest. In an environment\nwhere social norms continuously evolve, agents better adapted to the current\nsocial norms will have a higher probability of survival and proliferation,\nwhile those inadequately aligned dwindle over time. Experimental results\nassessing the agents from multiple perspectives in aligning with social norms\ndemonstrate that EvolutionaryAgent can align progressively better with the\nevolving social norms while maintaining its proficiency in general tasks.\nEffectiveness tests conducted on various open and closed-source LLMs as the\nfoundation for agents also prove the applicability of our approach.\n","authors":["Shimin Li","Tianxiang Sun","Xipeng Qiu"],"pdf_url":"https://arxiv.org/pdf/2401.04620v3.pdf","comment":"Work in progress"},{"id":"http://arxiv.org/abs/2401.09003v2","updated":"2024-01-20T12:43:37Z","published":"2024-01-17T06:48:16Z","title":"Augmenting Math Word Problems via Iterative Question Composing","summary":" Despite recent progress in improving the mathematical reasoning ability of\nlarge language models(LLMs), solving competition-level math problems without\nthe use of external tools remains challenging for open-source LLMs. In this\nwork, we introduce the MMIQC dataset, a mixture of processed web data and\nsynthetic question-response pairs, to equip base models with better\nmathematical reasoning skills. In different model sizes, the models fine-tuned\non MMIQC consistently outperform their counterparts by a clear margin on MATH\ntest set. Notably, DeepSeek-67B-MMIQC achieves a 41.0% accuracy, 4.2% higher\nthan the previous open-source SOTA. Our experiments also show that a large part\nof the improvement can be attributed to our novel augmentation method\nIQC(Iterative Question Composing), where we iteratively ask an LLM to compose\nnew questions from the given seed problems and do rejection sampling from\nanother LLM. MMIQC has now been released on\nhttps://huggingface.co/datasets/Vivacem/MMIQC.\n","authors":["Haoxiong Liu","Andrew Chi-Chih Yao"],"pdf_url":"https://arxiv.org/pdf/2401.09003v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11218v1","updated":"2024-01-20T12:00:40Z","published":"2024-01-20T12:00:40Z","title":"End-to-End Argument Mining over Varying Rhetorical Structures","summary":" Rhetorical Structure Theory implies no single discourse interpretation of a\ntext, and the limitations of RST parsers further exacerbate inconsistent\nparsing of similar structures. Therefore, it is important to take into account\nthat the same argumentative structure can be found in semantically similar\ntexts with varying rhetorical structures. In this work, the differences between\nparaphrases within the same argument scheme are evaluated from a rhetorical\nperspective. The study proposes a deep dependency parsing model to assess the\nconnection between rhetorical and argument structures. The model utilizes\nrhetorical relations; RST structures of paraphrases serve as training data\naugmentations. The method allows for end-to-end argumentation analysis using a\nrhetorical tree instead of a word sequence. It is evaluated on the bilingual\nMicrotexts corpus, and the first results on fully-fledged argument parsing for\nthe Russian version of the corpus are reported. The results suggest that\nargument mining can benefit from multiple variants of discourse structure.\n","authors":["Elena Chistova"],"pdf_url":"https://arxiv.org/pdf/2401.11218v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11207v1","updated":"2024-01-20T10:42:15Z","published":"2024-01-20T10:42:15Z","title":"Unfair TOS: An Automated Approach using Customized BERT","summary":" Terms of Service (ToS) form an integral part of any agreement as it defines\nthe legal relationship between a service provider and an end-user. Not only do\nthey establish and delineate reciprocal rights and responsibilities, but they\nalso provide users with information on essential aspects of contracts that\npertain to the use of digital spaces. These aspects include a wide range of\ntopics, including limitation of liability, data protection, etc. Users tend to\naccept the ToS without going through it before using any application or\nservice. Such ignorance puts them in a potentially weaker situation in case any\naction is required. Existing methodologies for the detection or classification\nof unfair clauses are however obsolete and show modest performance. In this\nresearch paper, we present SOTA(State of The Art) results on unfair clause\ndetection from ToS documents based on unprecedented Fine-tuning BERT in\nintegration with SVC(Support Vector Classifier). The study shows proficient\nperformance with a macro F1-score of 0.922 at unfair clause detection, and\nsuperior performance is also shown in the classification of unfair clauses by\neach tag. Further, a comparative analysis is performed by answering research\nquestions on the Transformer models utilized. In order to further research and\nexperimentation the code and results are made available on\nhttps://github.com/batking24/Unfair-TOS-An-Automated-Approach-based-on-Fine-tuning-BERT-in-conjunction-with-ML.\n","authors":["Bathini Sai Akash","Akshara Kupireddy","Lalita Bhanu Murthy"],"pdf_url":"https://arxiv.org/pdf/2401.11207v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11206v1","updated":"2024-01-20T10:41:03Z","published":"2024-01-20T10:41:03Z","title":"InferAligner: Inference-Time Alignment for Harmlessness through\n Cross-Model Guidance","summary":" With the rapid development of large language models (LLMs), they are not only\nused as general-purpose AI assistants but are also customized through further\nfine-tuning to meet the requirements of different applications. A pivotal\nfactor in the success of current LLMs is the alignment process. Current\nalignment methods, such as supervised fine-tuning (SFT) and reinforcement\nlearning from human feedback (RLHF), focus on training-time alignment and are\noften complex and cumbersome to implement. Therefore, we develop\n\\textbf{InferAligner}, a novel inference-time alignment method that utilizes\ncross-model guidance for harmlessness alignment. InferAligner utilizes safety\nsteering vectors extracted from safety-aligned model to modify the activations\nof the target model when responding to harmful inputs, thereby guiding the\ntarget model to provide harmless responses. Experimental results show that our\nmethod can be very effectively applied to domain-specific models in finance,\nmedicine, and mathematics, as well as to multimodal large language models\n(MLLMs) such as LLaVA. It significantly diminishes the Attack Success Rate\n(ASR) of both harmful instructions and jailbreak attacks, while maintaining\nalmost unchanged performance in downstream tasks.\n","authors":["Pengyu Wang","Dong Zhang","Linyang Li","Chenkun Tan","Xinghao Wang","Ke Ren","Botian Jiang","Xipeng Qiu"],"pdf_url":"https://arxiv.org/pdf/2401.11206v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11185v1","updated":"2024-01-20T09:49:59Z","published":"2024-01-20T09:49:59Z","title":"How the Advent of Ubiquitous Large Language Models both Stymie and\n Turbocharge Dynamic Adversarial Question Generation","summary":" Dynamic adversarial question generation, where humans write examples to stump\na model, aims to create examples that are realistic and informative. However,\nthe advent of large language models (LLMs) has been a double-edged sword for\nhuman authors: more people are interested in seeing and pushing the limits of\nthese models, but because the models are so much stronger an opponent, they are\nharder to defeat. To understand how these models impact adversarial question\nwriting process, we enrich the writing guidance with LLMs and retrieval models\nfor the authors to reason why their questions are not adversarial. While\nauthors could create interesting, challenging adversarial questions, they\nsometimes resort to tricks that result in poor questions that are ambiguous,\nsubjective, or confusing not just to a computer but also to humans. To address\nthese issues, we propose new metrics and incentives for eliciting good,\nchallenging questions and present a new dataset of adversarially authored\nquestions.\n","authors":["Yoo Yeon Sung","Ishani Mondal","Jordan Boyd-Graber"],"pdf_url":"https://arxiv.org/pdf/2401.11185v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.04691v4","updated":"2024-01-20T09:36:41Z","published":"2023-10-07T05:37:41Z","title":"EMO: Earth Mover Distance Optimization for Auto-Regressive Language\n Modeling","summary":" Neural language models are probabilistic models of human text. They are\npredominantly trained using maximum likelihood estimation (MLE), which is\nequivalent to minimizing the forward cross-entropy between the empirical data\ndistribution and the model distribution. However, various degeneration\nphenomena are still widely observed when decoding from the distributions\nlearned by such models. We establish that the forward cross-entropy is\nsuboptimal as a distance metric for aligning human and model distribution due\nto its (1) recall-prioritization (2) negative diversity ignorance and (3)\ntrain-test mismatch. In this paper, we propose Earth Mover Distance\nOptimization (EMO) for auto-regressive language modeling. EMO capitalizes on\nthe inherent properties of earth mover distance to address the aforementioned\nchallenges. Due to the high complexity of direct computation, we further\nintroduce a feasible upper bound for EMO to ease end-to-end training. Upon\nextensive evaluation of language models trained using EMO and MLE. We find that\nEMO demonstrates a consistently better language modeling performance than MLE\nacross domains. Moreover, EMO demonstrates noteworthy enhancements in\ndownstream performance with minimal fine-tuning on merely 25,000 sentences.\nThis highlights the tremendous potential of EMO as a lightweight calibration\nmethod for enhancing large-scale pre-trained language models.\n","authors":["Siyu Ren","Zhiyong Wu","Kenny Q. Zhu"],"pdf_url":"https://arxiv.org/pdf/2310.04691v4.pdf","comment":"To appear at ICLR 2024"},{"id":"http://arxiv.org/abs/2401.11143v1","updated":"2024-01-20T06:42:32Z","published":"2024-01-20T06:42:32Z","title":"Gaussian Adaptive Attention is All You Need: Robust Contextual\n Representations Across Multiple Modalities","summary":" We propose the Multi-Head Gaussian Adaptive Attention Mechanism (GAAM), a\nnovel probabilistic attention framework, and the Gaussian Adaptive Transformer\n(GAT), designed to enhance information aggregation across multiple modalities,\nincluding Speech, Text and Vision. GAAM integrates learnable mean and variance\ninto its attention mechanism, implemented in a Multi-Headed framework enabling\nit to collectively model any Probability Distribution for dynamic recalibration\nof feature significance. This method demonstrates significant improvements,\nespecially with highly non-stationary data, surpassing the state-of-the-art\nattention techniques in model performance (up to approximately +20% in\naccuracy) by identifying key elements within the feature space. GAAM's\ncompatibility with dot-product-based attention models and relatively low number\nof parameters showcases its adaptability and potential to boost existing\nattention frameworks. Empirically, GAAM exhibits superior adaptability and\nefficacy across a diverse range of tasks, including emotion recognition in\nspeech, image classification, and text classification, thereby establishing its\nrobustness and versatility in handling multi-modal data. Furthermore, we\nintroduce the Importance Factor (IF), a new learning-based metric that enhances\nthe explainability of models trained with GAAM-based methods. Overall, GAAM\nrepresents an advancement towards development of better performing and more\nexplainable attention models across multiple modalities.\n","authors":["Georgios Ioannides","Aman Chadha","Aaron Elkins"],"pdf_url":"https://arxiv.org/pdf/2401.11143v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.15407v2","updated":"2024-01-20T06:26:33Z","published":"2023-12-24T04:50:57Z","title":"A Comprehensive Analysis of the Effectiveness of Large Language Models\n as Automatic Dialogue Evaluators","summary":" Automatic evaluation is an integral aspect of dialogue system research. The\ntraditional reference-based NLG metrics are generally found to be unsuitable\nfor dialogue assessment. Consequently, recent studies have suggested various\nunique, reference-free neural metrics that better align with human evaluations.\nNotably among them, large language models (LLMs), particularly the\ninstruction-tuned variants like ChatGPT, are shown to be promising substitutes\nfor human judges. Yet, existing works on utilizing LLMs for automatic dialogue\nevaluation are limited in their scope in terms of the number of meta-evaluation\ndatasets, mode of evaluation, coverage of LLMs, etc. Hence, it remains\ninconclusive how effective these LLMs are. To this end, we conduct a\ncomprehensive study on the application of LLMs for automatic dialogue\nevaluation. Specifically, we analyze the multi-dimensional evaluation\ncapability of 30 recently emerged LLMs at both turn and dialogue levels, using\na comprehensive set of 12 meta-evaluation datasets. Additionally, we probe the\nrobustness of the LLMs in handling various adversarial perturbations at both\nturn and dialogue levels. Finally, we explore how model-level and\ndimension-level ensembles impact the evaluation performance. All resources are\navailable at https://github.com/e0397123/comp-analysis.\n","authors":["Chen Zhang","Luis Fernando D'Haro","Yiming Chen","Malu Zhang","Haizhou Li"],"pdf_url":"https://arxiv.org/pdf/2312.15407v2.pdf","comment":"An extended version of AAAI-2024 camera-ready paper (appendix\n included, 16 pages)"},{"id":"http://arxiv.org/abs/2401.11120v1","updated":"2024-01-20T05:10:46Z","published":"2024-01-20T05:10:46Z","title":"Enhancing Large Language Models for Clinical Decision Support by\n Incorporating Clinical Practice Guidelines","summary":" Background Large Language Models (LLMs), enhanced with Clinical Practice\nGuidelines (CPGs), can significantly improve Clinical Decision Support (CDS).\nHowever, methods for incorporating CPGs into LLMs are not well studied. Methods\nWe develop three distinct methods for incorporating CPGs into LLMs: Binary\nDecision Tree (BDT), Program-Aided Graph Construction (PAGC), and\nChain-of-Thought-Few-Shot Prompting (CoT-FSP). To evaluate the effectiveness of\nthe proposed methods, we create a set of synthetic patient descriptions and\nconduct both automatic and human evaluation of the responses generated by four\nLLMs: GPT-4, GPT-3.5 Turbo, LLaMA, and PaLM 2. Zero-Shot Prompting (ZSP) was\nused as the baseline method. We focus on CDS for COVID-19 outpatient treatment\nas the case study. Results All four LLMs exhibit improved performance when\nenhanced with CPGs compared to the baseline ZSP. BDT outperformed both CoT-FSP\nand PAGC in automatic evaluation. All of the proposed methods demonstrated high\nperformance in human evaluation. Conclusion LLMs enhanced with CPGs demonstrate\nsuperior performance, as compared to plain LLMs with ZSP, in providing accurate\nrecommendations for COVID-19 outpatient treatment, which also highlights the\npotential for broader applications beyond the case study.\n","authors":["David Oniani","Xizhi Wu","Shyam Visweswaran","Sumit Kapoor","Shravan Kooragayalu","Katelyn Polanska","Yanshan Wang"],"pdf_url":"https://arxiv.org/pdf/2401.11120v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.10873v2","updated":"2024-01-20T03:58:10Z","published":"2023-10-16T22:53:54Z","title":"IDEAL: Influence-Driven Selective Annotations Empower In-Context\n Learners in Large Language Models","summary":" In-context learning is a promising paradigm that utilizes in-context examples\nas prompts for the predictions of large language models. These prompts are\ncrucial for achieving strong performance. However, since the prompts need to be\nsampled from a large volume of annotated examples, finding the right prompt may\nresult in high annotation costs. To address this challenge, this paper\nintroduces an influence-driven selective annotation method that aims to\nminimize annotation costs while improving the quality of in-context examples.\nThe essence of our method is to select a pivotal subset from a large-scale\nunlabeled data pool to annotate for the subsequent sampling of prompts.\nSpecifically, a directed graph is first constructed to represent unlabeled\ndata. Afterward, the influence of candidate unlabeled subsets is quantified\nwith a diffusion process. A simple yet effective greedy algorithm for unlabeled\ndata selection is lastly introduced. It iteratively selects the data if it\nprovides a maximum marginal gain with respect to quantified influence. Compared\nwith previous efforts on selective annotations, our influence-driven method\nworks in an end-to-end manner, avoids an intractable explicit balance between\ndata diversity and representativeness, and enjoys theoretical support.\nExperiments confirm the superiority of the proposed method on various\nbenchmarks, achieving better performance under lower time consumption during\nsubset selection. The project page is available at\nhttps://skzhang1.github.io/IDEAL/.\n","authors":["Shaokun Zhang","Xiaobo Xia","Zhaoqing Wang","Ling-Hao Chen","Jiale Liu","Qingyun Wu","Tongliang Liu"],"pdf_url":"https://arxiv.org/pdf/2310.10873v2.pdf","comment":"Accepted by ICLR 2024"},{"id":"http://arxiv.org/abs/2401.11107v1","updated":"2024-01-20T03:55:17Z","published":"2024-01-20T03:55:17Z","title":"Exploiting Duality in Open Information Extraction with Predicate Prompt","summary":" Open information extraction (OpenIE) aims to extract the schema-free triplets\nin the form of (\\emph{subject}, \\emph{predicate}, \\emph{object}) from a given\nsentence. Compared with general information extraction (IE), OpenIE poses more\nchallenges for the IE models, {especially when multiple complicated triplets\nexist in a sentence. To extract these complicated triplets more effectively, in\nthis paper we propose a novel generative OpenIE model, namely \\emph{DualOIE},\nwhich achieves a dual task at the same time as extracting some triplets from\nthe sentence, i.e., converting the triplets into the sentence.} Such dual task\nencourages the model to correctly recognize the structure of the given sentence\nand thus is helpful to extract all potential triplets from the sentence.\nSpecifically, DualOIE extracts the triplets in two steps: 1) first extracting a\nsequence of all potential predicates, 2) then using the predicate sequence as a\nprompt to induce the generation of triplets. Our experiments on two benchmarks\nand our dataset constructed from Meituan demonstrate that DualOIE achieves the\nbest performance among the state-of-the-art baselines. Furthermore, the online\nA/B test on Meituan platform shows that 0.93\\% improvement of QV-CTR and 0.56\\%\nimprovement of UV-CTR have been obtained when the triplets extracted by DualOIE\nwere leveraged in Meituan's search system.\n","authors":["Zhen Chen","Jingping Liu","Deqing Yang","Yanghua Xiao","Huimin Xu","Zongyu Wang","Rui Xie","Yunsen Xian"],"pdf_url":"https://arxiv.org/pdf/2401.11107v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.08577v3","updated":"2024-01-20T02:36:12Z","published":"2023-02-16T20:46:36Z","title":"For Generated Text, Is NLI-Neutral Text the Best Text?","summary":" We explore incorporating natural language inference (NLI) into the text\ngenerative pipeline by using a pre-trained NLI model to assess whether a\ngenerated sentence entails, contradicts, or is neutral to the prompt and\npreceding text. First, we show that the NLI task is predictive of generation\nerrors made by GPT-3. We use these results to develop an NLI-informed\ngeneration procedure for GPT-J. Then, we evaluate these generations by\nobtaining human annotations on error types and overall quality. We find that an\nNLI strategy of maximizing entailment improves text generation when the nucleus\nsampling randomness parameter value is high, while one which maximizes\ncontradiction is in fact productive when the parameter value is low. Overall,\nthough, we demonstrate that an NLI strategy of maximizing the neutral class\nprovides the highest quality of generated text (significantly better than the\nvanilla generations), regardless of parameter value.\n","authors":["Michail Mersinias","Kyle Mahowald"],"pdf_url":"https://arxiv.org/pdf/2302.08577v3.pdf","comment":null}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2401.11248v1","updated":"2024-01-20T15:02:33Z","published":"2024-01-20T15:02:33Z","title":"Drop your Decoder: Pre-training with Bag-of-Word Prediction for Dense\n Passage Retrieval","summary":" Masked auto-encoder pre-training has emerged as a prevalent technique for\ninitializing and enhancing dense retrieval systems. It generally utilizes\nadditional Transformer decoder blocks to provide sustainable supervision\nsignals and compress contextual information into dense representations.\nHowever, the underlying reasons for the effectiveness of such a pre-training\ntechnique remain unclear. The usage of additional Transformer-based decoders\nalso incurs significant computational costs. In this study, we aim to shed\nlight on this issue by revealing that masked auto-encoder (MAE) pre-training\nwith enhanced decoding significantly improves the term coverage of input tokens\nin dense representations, compared to vanilla BERT checkpoints. Building upon\nthis observation, we propose a modification to the traditional MAE by replacing\nthe decoder of a masked auto-encoder with a completely simplified Bag-of-Word\nprediction task. This modification enables the efficient compression of lexical\nsignals into dense representations through unsupervised pre-training.\nRemarkably, our proposed method achieves state-of-the-art retrieval performance\non several large-scale retrieval benchmarks without requiring any additional\nparameters, which provides a 67% training speed-up compared to standard masked\nauto-encoder pre-training with enhanced decoding.\n","authors":["Guangyuan Ma","Xing Wu","Zijia Lin","Songlin Hu"],"pdf_url":"https://arxiv.org/pdf/2401.11248v1.pdf","comment":"Working in progress. Our code will be available at\n https://github.com/ma787639046/bowdpr"},{"id":"http://arxiv.org/abs/2401.11246v1","updated":"2024-01-20T14:59:43Z","published":"2024-01-20T14:59:43Z","title":"Prompt-RAG: Pioneering Vector Embedding-Free Retrieval-Augmented\n Generation in Niche Domains, Exemplified by Korean Medicine","summary":" We propose a natural language prompt-based retrieval augmented generation\n(Prompt-RAG), a novel approach to enhance the performance of generative large\nlanguage models (LLMs) in niche domains. Conventional RAG methods mostly\nrequire vector embeddings, yet the suitability of generic LLM-based embedding\nrepresentations for specialized domains remains uncertain. To explore and\nexemplify this point, we compared vector embeddings from Korean Medicine (KM)\nand Conventional Medicine (CM) documents, finding that KM document embeddings\ncorrelated more with token overlaps and less with human-assessed document\nrelatedness, in contrast to CM embeddings. Prompt-RAG, distinct from\nconventional RAG models, operates without the need for embedding vectors. Its\nperformance was assessed through a Question-Answering (QA) chatbot application,\nwhere responses were evaluated for relevance, readability, and informativeness.\nThe results showed that Prompt-RAG outperformed existing models, including\nChatGPT and conventional vector embedding-based RAGs, in terms of relevance and\ninformativeness. Despite challenges like content structuring and response\nlatency, the advancements in LLMs are expected to encourage the use of\nPrompt-RAG, making it a promising tool for other domains in need of RAG\nmethods.\n","authors":["Bongsu Kang","Jundong Kim","Tae-Rim Yun","Chang-Eop Kim"],"pdf_url":"https://arxiv.org/pdf/2401.11246v1.pdf","comment":"26 pages, 4 figures, 5 tables"},{"id":"http://arxiv.org/abs/2305.16326v2","updated":"2024-01-20T14:33:54Z","published":"2023-05-10T13:40:06Z","title":"Large language models in biomedical natural language processing:\n benchmarks, baselines, and recommendations","summary":" Biomedical literature is growing rapidly, making it challenging to curate and\nextract knowledge manually. Biomedical natural language processing (BioNLP)\ntechniques that can automatically extract information from biomedical\nliterature help alleviate this burden. Recently, large Language Models (LLMs),\nsuch as GPT-3 and GPT-4, have gained significant attention for their impressive\nperformance. However, their effectiveness in BioNLP tasks and impact on method\ndevelopment and downstream users remain understudied. This pilot study (1)\nestablishes the baseline performance of GPT-3 and GPT-4 at both zero-shot and\none-shot settings in eight BioNLP datasets across four applications: named\nentity recognition, relation extraction, multi-label document classification,\nand semantic similarity and reasoning, (2) examines the errors produced by the\nLLMs and categorized the errors into three types: missingness, inconsistencies,\nand unwanted artificial content, and (3) provides suggestions for using LLMs in\nBioNLP applications. We make the datasets, baselines, and results publicly\navailable to the community via\nhttps://github.com/qingyu-qc/gpt_bionlp_benchmark.\n","authors":["Qingyu Chen","Jingcheng Du","Yan Hu","Vipina Kuttichi Keloth","Xueqing Peng","Kalpana Raja","Rui Zhang","Zhiyong Lu","Hua Xu"],"pdf_url":"https://arxiv.org/pdf/2305.16326v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11201v1","updated":"2024-01-20T10:28:25Z","published":"2024-01-20T10:28:25Z","title":"Navigating the Thin Line: Examining User Behavior in Search to Detect\n Engagement and Backfire Effects","summary":" Opinionated users often seek information that aligns with their preexisting\nbeliefs while dismissing contradictory evidence due to confirmation bias. This\nconduct hinders their ability to consider alternative stances when searching\nthe web. Despite this, few studies have analyzed how the diversification of\nsearch results on disputed topics influences the search behavior of highly\nopinionated users. To this end, we present a preregistered user study (n = 257)\ninvestigating whether different levels (low and high) of bias metrics and\nsearch results presentation (with or without AI-predicted stances labels) can\naffect the stance diversity consumption and search behavior of opinionated\nusers on three debated topics (i.e., atheism, intellectual property rights, and\nschool uniforms). Our results show that exposing participants to\n(counter-attitudinally) biased search results increases their consumption of\nattitude-opposing content, but we also found that bias was associated with a\ntrend toward overall fewer interactions within the search page. We also found\nthat 19% of users interacted with queries and search pages but did not select\nany search results. When we removed these participants in a post-hoc analysis,\nwe found that stance labels increased the diversity of stances consumed by\nusers, particularly when the search results were biased. Our findings highlight\nthe need for future research to explore distinct search scenario settings to\ngain insight into opinionated users' behavior.\n","authors":["F. M. Cau","N. Tintarev"],"pdf_url":"https://arxiv.org/pdf/2401.11201v1.pdf","comment":"17 pages, 3 figures, ECIR2024 (46th European Conference on\n Information Retrieval - IR4Good track)"},{"id":"http://arxiv.org/abs/2401.11198v1","updated":"2024-01-20T10:25:58Z","published":"2024-01-20T10:25:58Z","title":"A Deep Learning Approach for Selective Relevance Feedback","summary":" Pseudo-relevance feedback (PRF) can enhance average retrieval effectiveness\nover a sufficiently large number of queries. However, PRF often introduces a\ndrift into the original information need, thus hurting the retrieval\neffectiveness of several queries. While a selective application of PRF can\npotentially alleviate this issue, previous approaches have largely relied on\nunsupervised or feature-based learning to determine whether a query should be\nexpanded. In contrast, we revisit the problem of selective PRF from a deep\nlearning perspective, presenting a model that is entirely data-driven and\ntrained in an end-to-end manner. The proposed model leverages a\ntransformer-based bi-encoder architecture. Additionally, to further improve\nretrieval effectiveness with this selective PRF approach, we make use of the\nmodel's confidence estimates to combine the information from the original and\nexpanded queries. In our experiments, we apply this selective feedback on a\nnumber of different combinations of ranking and feedback models, and show that\nour proposed approach consistently improves retrieval effectiveness for both\nsparse and dense ranking models, with the feedback models being either sparse,\ndense or generative.\n","authors":["Suchana Datta","Debasis Ganguly","Sean MacAvaney","Derek Greene"],"pdf_url":"https://arxiv.org/pdf/2401.11198v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.07042v2","updated":"2024-01-20T08:58:56Z","published":"2023-04-14T10:33:56Z","title":"Learning Graph ODE for Continuous-Time Sequential Recommendation","summary":" Sequential recommendation aims at understanding user preference by capturing\nsuccessive behavior correlations, which are usually represented as the item\npurchasing sequences based on their past interactions. Existing efforts\ngenerally predict the next item via modeling the sequential patterns. Despite\neffectiveness, there exist two natural deficiencies: (i) user preference is\ndynamic in nature, and the evolution of collaborative signals is often ignored;\nand (ii) the observed interactions are often irregularly-sampled, while\nexisting methods model item transitions assuming uniform intervals. Thus, how\nto effectively model and predict the underlying dynamics for user preference\nbecomes a critical research problem. To tackle the above challenges, in this\npaper, we focus on continuous-time sequential recommendation and propose a\nprincipled graph ordinary differential equation framework named GDERec.\nTechnically, GDERec is characterized by an autoregressive graph ordinary\ndifferential equation consisting of two components, which are parameterized by\ntwo tailored graph neural networks (GNNs) respectively to capture user\npreference from the perspective of hybrid dynamical systems. The two customized\nGNNs are trained alternately in an autoregressive manner to track the evolution\nof the underlying system from irregular observations, and thus learn effective\nrepresentations of users and items beneficial to the sequential recommendation.\nExtensive experiments on five benchmark datasets demonstrate the superiority of\nour model over various state-of-the-art recommendation methods.\n","authors":["Yifang Qin","Wei Ju","Hongjun Wu","Xiao Luo","Ming Zhang"],"pdf_url":"https://arxiv.org/pdf/2304.07042v2.pdf","comment":"Accepted by EEE Transactions on Knowledge and Data Engineering (TKDE\n 2024)"},{"id":"http://arxiv.org/abs/2401.11145v1","updated":"2024-01-20T06:52:14Z","published":"2024-01-20T06:52:14Z","title":"Document Set Expansion with Positive-Unlabeled Learning: A Density\n Estimation-based Approach","summary":" Document set expansion aims to identify relevant documents from a large\ncollection based on a small set of documents that are on a fine-grained topic.\nPrevious work shows that PU learning is a promising method for this task.\nHowever, some serious issues remain unresolved, i.e. typical challenges that PU\nmethods suffer such as unknown class prior and imbalanced data, and the need\nfor transductive experimental settings. In this paper, we propose a novel PU\nlearning framework based on density estimation, called puDE, that can handle\nthe above issues. The advantage of puDE is that it neither constrained to the\nSCAR assumption and nor require any class prior knowledge. We demonstrate the\neffectiveness of the proposed method using a series of real-world datasets and\nconclude that our method is a better alternative for the DSE task.\n","authors":["Haiyang Zhang","Qiuyi Chen","Yuanjie Zou","Yushan Pan","Jia Wang","Mark Stevenson"],"pdf_url":"https://arxiv.org/pdf/2401.11145v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.10049v2","updated":"2024-01-20T05:12:52Z","published":"2023-12-02T06:36:14Z","title":"Knowledge Graph Reasoning Based on Attention GCN","summary":" We propose a novel technique to enhance Knowledge Graph Reasoning by\ncombining Graph Convolution Neural Network (GCN) with the Attention Mechanism.\nThis approach utilizes the Attention Mechanism to examine the relationships\nbetween entities and their neighboring nodes, which helps to develop detailed\nfeature vectors for each entity. The GCN uses shared parameters to effectively\nrepresent the characteristics of adjacent entities. We first learn the\nsimilarity of entities for node representation learning. By integrating the\nattributes of the entities and their interactions, this method generates\nextensive implicit feature vectors for each entity, improving performance in\ntasks including entity classification and link prediction, outperforming\ntraditional neural network models. To conclude, this work provides crucial\nmethodological support for a range of applications, such as search engines,\nquestion-answering systems, recommendation systems, and data integration tasks.\n","authors":["Meera Gupta","Ravi Khanna","Divya Choudhary","Nandini Rao"],"pdf_url":"https://arxiv.org/pdf/2312.10049v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11107v1","updated":"2024-01-20T03:55:17Z","published":"2024-01-20T03:55:17Z","title":"Exploiting Duality in Open Information Extraction with Predicate Prompt","summary":" Open information extraction (OpenIE) aims to extract the schema-free triplets\nin the form of (\\emph{subject}, \\emph{predicate}, \\emph{object}) from a given\nsentence. Compared with general information extraction (IE), OpenIE poses more\nchallenges for the IE models, {especially when multiple complicated triplets\nexist in a sentence. To extract these complicated triplets more effectively, in\nthis paper we propose a novel generative OpenIE model, namely \\emph{DualOIE},\nwhich achieves a dual task at the same time as extracting some triplets from\nthe sentence, i.e., converting the triplets into the sentence.} Such dual task\nencourages the model to correctly recognize the structure of the given sentence\nand thus is helpful to extract all potential triplets from the sentence.\nSpecifically, DualOIE extracts the triplets in two steps: 1) first extracting a\nsequence of all potential predicates, 2) then using the predicate sequence as a\nprompt to induce the generation of triplets. Our experiments on two benchmarks\nand our dataset constructed from Meituan demonstrate that DualOIE achieves the\nbest performance among the state-of-the-art baselines. Furthermore, the online\nA/B test on Meituan platform shows that 0.93\\% improvement of QV-CTR and 0.56\\%\nimprovement of UV-CTR have been obtained when the triplets extracted by DualOIE\nwere leveraged in Meituan's search system.\n","authors":["Zhen Chen","Jingping Liu","Deqing Yang","Yanghua Xiao","Huimin Xu","Zongyu Wang","Rui Xie","Yunsen Xian"],"pdf_url":"https://arxiv.org/pdf/2401.11107v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11089v1","updated":"2024-01-20T02:38:21Z","published":"2024-01-20T02:38:21Z","title":"FedRKG: A Privacy-preserving Federated Recommendation Framework via\n Knowledge Graph Enhancement","summary":" Federated Learning (FL) has emerged as a promising approach for preserving\ndata privacy in recommendation systems by training models locally. Recently,\nGraph Neural Networks (GNN) have gained popularity in recommendation tasks due\nto their ability to capture high-order interactions between users and items.\nHowever, privacy concerns prevent the global sharing of the entire user-item\ngraph. To address this limitation, some methods create pseudo-interacted items\nor users in the graph to compensate for missing information for each client.\nUnfortunately, these methods introduce random noise and raise privacy concerns.\nIn this paper, we propose FedRKG, a novel federated recommendation system,\nwhere a global knowledge graph (KG) is constructed and maintained on the server\nusing publicly available item information, enabling higher-order user-item\ninteractions. On the client side, a relation-aware GNN model leverages diverse\nKG relationships. To protect local interaction items and obscure gradients, we\nemploy pseudo-labeling and Local Differential Privacy (LDP). Extensive\nexperiments conducted on three real-world datasets demonstrate the competitive\nperformance of our approach compared to centralized algorithms while ensuring\nprivacy preservation. Moreover, FedRKG achieves an average accuracy improvement\nof 4% compared to existing federated learning baselines.\n","authors":["Dezhong Yao","Tongtong Liu","Qi Cao","Hai Jin"],"pdf_url":"https://arxiv.org/pdf/2401.11089v1.pdf","comment":null}]},"2024-01-23T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2401.12975v1","updated":"2024-01-23T18:59:43Z","published":"2024-01-23T18:59:43Z","title":"HAZARD Challenge: Embodied Decision Making in Dynamically Changing\n Environments","summary":" Recent advances in high-fidelity virtual environments serve as one of the\nmajor driving forces for building intelligent embodied agents to perceive,\nreason and interact with the physical world. Typically, these environments\nremain unchanged unless agents interact with them. However, in real-world\nscenarios, agents might also face dynamically changing environments\ncharacterized by unexpected events and need to rapidly take action accordingly.\nTo remedy this gap, we propose a new simulated embodied benchmark, called\nHAZARD, specifically designed to assess the decision-making abilities of\nembodied agents in dynamic situations. HAZARD consists of three unexpected\ndisaster scenarios, including fire, flood, and wind, and specifically supports\nthe utilization of large language models (LLMs) to assist common sense\nreasoning and decision-making. This benchmark enables us to evaluate autonomous\nagents' decision-making capabilities across various pipelines, including\nreinforcement learning (RL), rule-based, and search-based methods in\ndynamically changing environments. As a first step toward addressing this\nchallenge using large language models, we further develop an LLM-based agent\nand perform an in-depth analysis of its promise and challenge of solving these\nchallenging tasks. HAZARD is available at https://vis-www.cs.umass.edu/hazard/.\n","authors":["Qinhong Zhou","Sunli Chen","Yisong Wang","Haozhe Xu","Weihua Du","Hongxin Zhang","Yilun Du","Joshua B. Tenenbaum","Chuang Gan"],"pdf_url":"https://arxiv.org/pdf/2401.12975v1.pdf","comment":"ICLR 2024. The first two authors contributed equally to this work"},{"id":"http://arxiv.org/abs/2401.12973v1","updated":"2024-01-23T18:59:21Z","published":"2024-01-23T18:59:21Z","title":"In-Context Language Learning: Arhitectures and Algorithms","summary":" Large-scale neural language models exhibit a remarkable capacity for\nin-context learning (ICL): they can infer novel functions from datasets\nprovided as input. Most of our current understanding of when and how ICL arises\ncomes from LMs trained on extremely simple learning problems like linear\nregression and associative recall. There remains a significant gap between\nthese model problems and the \"real\" ICL exhibited by LMs trained on large text\ncorpora, which involves not just retrieval and function approximation but\nfree-form generation of language and other structured outputs. In this paper,\nwe study ICL through the lens of a new family of model problems we term in\ncontext language learning (ICLL). In ICLL, LMs are presented with a set of\nstrings from a formal language, and must generate additional strings from the\nsame language. We focus on in-context learning of regular languages generated\nby random finite automata. We evaluate a diverse set of neural sequence models\n(including several RNNs, Transformers, and state-space model variants) on\nregular ICLL tasks, aiming to answer three questions: (1) Which model classes\nare empirically capable of ICLL? (2) What algorithmic solutions do successful\nmodels implement to perform ICLL? (3) What architectural changes can improve\nICLL in less performant models? We first show that Transformers significantly\noutperform neural sequence models with recurrent or convolutional\nrepresentations on ICLL tasks. Next, we provide evidence that their ability to\ndo so relies on specialized \"n-gram heads\" (higher-order variants of induction\nheads) that compute input-conditional next-token distributions. Finally, we\nshow that hard-wiring these heads into recurrent and convolutional models\nimproves performance not just on ICLL, but natural language modeling --\nimproving the perplexity of 340M-parameter models by up to 1.14 points (6.7%)\non the SlimPajama dataset.\n","authors":["Ekin Akyürek","Bailin Wang","Yoon Kim","Jacob Andreas"],"pdf_url":"https://arxiv.org/pdf/2401.12973v1.pdf","comment":"29 pages, 8 figures"},{"id":"http://arxiv.org/abs/2401.12970v1","updated":"2024-01-23T18:57:53Z","published":"2024-01-23T18:57:53Z","title":"Raidar: geneRative AI Detection viA Rewriting","summary":" We find that large language models (LLMs) are more likely to modify\nhuman-written text than AI-generated text when tasked with rewriting. This\ntendency arises because LLMs often perceive AI-generated text as high-quality,\nleading to fewer modifications. We introduce a method to detect AI-generated\ncontent by prompting LLMs to rewrite text and calculating the editing distance\nof the output. We dubbed our geneRative AI Detection viA Rewriting method\nRaidar. Raidar significantly improves the F1 detection scores of existing AI\ncontent detection models -- both academic and commercial -- across various\ndomains, including News, creative writing, student essays, code, Yelp reviews,\nand arXiv papers, with gains of up to 29 points. Operating solely on word\nsymbols without high-dimensional features, our method is compatible with black\nbox LLMs, and is inherently robust on new content. Our results illustrate the\nunique imprint of machine-generated text through the lens of the machines\nthemselves.\n","authors":["Chengzhi Mao","Carl Vondrick","Hao Wang","Junfeng Yang"],"pdf_url":"https://arxiv.org/pdf/2401.12970v1.pdf","comment":"Accepted by ICLR 2024"},{"id":"http://arxiv.org/abs/2401.12963v1","updated":"2024-01-23T18:45:54Z","published":"2024-01-23T18:45:54Z","title":"AutoRT: Embodied Foundation Models for Large Scale Orchestration of\n Robotic Agents","summary":" Foundation models that incorporate language, vision, and more recently\nactions have revolutionized the ability to harness internet scale data to\nreason about useful tasks. However, one of the key challenges of training\nembodied foundation models is the lack of data grounded in the physical world.\nIn this paper, we propose AutoRT, a system that leverages existing foundation\nmodels to scale up the deployment of operational robots in completely unseen\nscenarios with minimal human supervision. AutoRT leverages vision-language\nmodels (VLMs) for scene understanding and grounding, and further uses large\nlanguage models (LLMs) for proposing diverse and novel instructions to be\nperformed by a fleet of robots. Guiding data collection by tapping into the\nknowledge of foundation models enables AutoRT to effectively reason about\nautonomy tradeoffs and safety while significantly scaling up data collection\nfor robot learning. We demonstrate AutoRT proposing instructions to over 20\nrobots across multiple buildings and collecting 77k real robot episodes via\nboth teleoperation and autonomous robot policies. We experimentally show that\nsuch \"in-the-wild\" data collected by AutoRT is significantly more diverse, and\nthat AutoRT's use of LLMs allows for instruction following data collection\nrobots that can align to human preferences.\n","authors":["Michael Ahn","Debidatta Dwibedi","Chelsea Finn","Montse Gonzalez Arenas","Keerthana Gopalakrishnan","Karol Hausman","Brian Ichter","Alex Irpan","Nikhil Joshi","Ryan Julian","Sean Kirmani","Isabel Leal","Edward Lee","Sergey Levine","Yao Lu","Isabel Leal","Sharath Maddineni","Kanishka Rao","Dorsa Sadigh","Pannag Sanketi","Pierre Sermanet","Quan Vuong","Stefan Welker","Fei Xia","Ted Xiao","Peng Xu","Steve Xu","Zhuo Xu"],"pdf_url":"https://arxiv.org/pdf/2401.12963v1.pdf","comment":"26 pages, 9 figures"},{"id":"http://arxiv.org/abs/2310.08535v2","updated":"2024-01-23T18:35:40Z","published":"2023-10-12T17:24:15Z","title":"Formally Specifying the High-Level Behavior of LLM-Based Agents","summary":" Autonomous, goal-driven agents powered by LLMs have recently emerged as\npromising tools for solving challenging problems without the need for\ntask-specific finetuned models that can be expensive to procure. Currently, the\ndesign and implementation of such agents is ad hoc, as the wide variety of\ntasks that LLM-based agents may be applied to naturally means there can be no\none-size-fits-all approach to agent design. In this work we aim to alleviate\nthe difficulty of designing and implementing new agents by proposing a\nminimalistic generation framework that simplifies the process of building\nagents. The framework we introduce allows the user to define desired agent\nbehaviors in a high-level, declarative specification that is then used to\nconstruct a decoding monitor which guarantees the LLM will produce an output\nexhibiting the desired behavior. Our declarative approach, in which the\nbehavior is described without concern for how it should be implemented or\nenforced, enables rapid design, implementation, and experimentation with\ndifferent LLM-based agents. We demonstrate how the proposed framework can be\nused to implement recent LLM-based agents (e.g., ReACT), and show how the\nflexibility of our approach can be leveraged to define a new agent with more\ncomplex behavior, the Plan-Act-Summarize-Solve (PASS) agent. Lastly, we\ndemonstrate that our method outperforms other agents on multiple popular\nreasoning-centric question-answering benchmarks.\n","authors":["Maxwell Crouse","Ibrahim Abdelaziz","Ramon Astudillo","Kinjal Basu","Soham Dan","Sadhana Kumaravel","Achille Fokoue","Pavan Kapanipathi","Salim Roukos","Luis Lastras"],"pdf_url":"https://arxiv.org/pdf/2310.08535v2.pdf","comment":"Preprint under review"},{"id":"http://arxiv.org/abs/2401.12954v1","updated":"2024-01-23T18:22:19Z","published":"2024-01-23T18:22:19Z","title":"Meta-Prompting: Enhancing Language Models with Task-Agnostic Scaffolding","summary":" We introduce meta-prompting, an effective scaffolding technique designed to\nenhance the functionality of language models (LMs). This approach transforms a\nsingle LM into a multi-faceted conductor, adept at managing and integrating\nmultiple independent LM queries. By employing high-level instructions,\nmeta-prompting guides the LM to break down complex tasks into smaller, more\nmanageable subtasks. These subtasks are then handled by distinct \"expert\"\ninstances of the same LM, each operating under specific, tailored instructions.\nCentral to this process is the LM itself, in its role as the conductor, which\nensures seamless communication and effective integration of the outputs from\nthese expert models. It additionally employs its inherent critical thinking and\nrobust verification processes to refine and authenticate the end result. This\ncollaborative prompting approach empowers a single LM to simultaneously act as\na comprehensive orchestrator and a panel of diverse experts, significantly\nenhancing its performance across a wide array of tasks. The zero-shot,\ntask-agnostic nature of meta-prompting greatly simplifies user interaction by\nobviating the need for detailed, task-specific instructions. Furthermore, our\nresearch demonstrates the seamless integration of external tools, such as a\nPython interpreter, into the meta-prompting framework, thereby broadening its\napplicability and utility. Through rigorous experimentation with GPT-4, we\nestablish the superiority of meta-prompting over conventional scaffolding\nmethods: When averaged across all tasks, including the Game of 24,\nCheckmate-in-One, and Python Programming Puzzles, meta-prompting, augmented\nwith a Python interpreter functionality, surpasses standard prompting by 17.1%,\nexpert (dynamic) prompting by 17.3%, and multipersona prompting by 15.2%.\n","authors":["Mirac Suzgun","Adam Tauman Kalai"],"pdf_url":"https://arxiv.org/pdf/2401.12954v1.pdf","comment":"https://github.com/suzgunmirac/meta-prompting"},{"id":"http://arxiv.org/abs/2310.17715v2","updated":"2024-01-23T18:19:18Z","published":"2023-10-26T18:22:13Z","title":"Outlier Dimensions Encode Task-Specific Knowledge","summary":" Representations from large language models (LLMs) are known to be dominated\nby a small subset of dimensions with exceedingly high variance. Previous works\nhave argued that although ablating these outlier dimensions in LLM\nrepresentations hurts downstream performance, outlier dimensions are\ndetrimental to the representational quality of embeddings. In this study, we\ninvestigate how fine-tuning impacts outlier dimensions and show that 1) outlier\ndimensions that occur in pre-training persist in fine-tuned models and 2) a\nsingle outlier dimension can complete downstream tasks with a minimal error\nrate. Our results suggest that outlier dimensions can encode crucial\ntask-specific knowledge and that the value of a representation in a single\noutlier dimension drives downstream model decisions.\n","authors":["William Rudman","Catherine Chen","Carsten Eickhoff"],"pdf_url":"https://arxiv.org/pdf/2310.17715v2.pdf","comment":"Camera-ready version for EMNLP 2023"},{"id":"http://arxiv.org/abs/2401.12947v1","updated":"2024-01-23T18:07:38Z","published":"2024-01-23T18:07:38Z","title":"Transformer-Based Models Are Not Yet Perfect At Learning to Emulate\n Structural Recursion","summary":" This paper investigates the ability of transformer-based models to learn\nstructural recursion from examples. Recursion is a universal concept in both\nnatural and formal languages. Structural recursion is central to the\nprogramming language and formal mathematics tasks where symbolic tools\ncurrently excel beyond neural models, such as inferring semantic relations\nbetween datatypes and emulating program behavior. We introduce a general\nframework that nicely connects the abstract concepts of structural recursion in\nthe programming language domain to concrete sequence modeling problems and\nlearned models' behavior. The framework includes a representation that captures\nthe general \\textit{syntax} of structural recursion, coupled with two different\nframeworks for understanding their \\textit{semantics} -- one that is more\nnatural from a programming languages perspective and one that helps bridge that\nperspective with a mechanistic understanding of the underlying transformer\narchitecture.\n With our framework as a powerful conceptual tool, we identify different\nissues under various set-ups. The models trained to emulate recursive\ncomputations cannot fully capture the recursion yet instead fit short-cut\nalgorithms and thus cannot solve certain edge cases that are under-represented\nin the training distribution. In addition, it is difficult for state-of-the-art\nlarge language models (LLMs) to mine recursive rules from in-context\ndemonstrations. Meanwhile, these LLMs fail in interesting ways when emulating\nreduction (step-wise computation) of the recursive function.\n","authors":["Dylan Zhang","Curt Tigges","Zory Zhang","Stella Biderman","Maxim Raginsky","Talia Ringer"],"pdf_url":"https://arxiv.org/pdf/2401.12947v1.pdf","comment":"arXiv admin note: text overlap with arXiv:2305.14699"},{"id":"http://arxiv.org/abs/2401.12941v1","updated":"2024-01-23T17:58:38Z","published":"2024-01-23T17:58:38Z","title":"Multicultural Name Recognition For Previously Unseen Names","summary":" State of the art Named Entity Recognition (NER) models have achieved an\nimpressive ability to extract common phrases from text that belong to labels\nsuch as location, organization, time, and person. However, typical NER systems\nthat rely on having seen a specific entity in their training data in order to\nlabel an entity perform poorly on rare or unseen entities ta in order to label\nan entity perform poorly on rare or unseen entities (Derczynski et al., 2017).\nThis paper attempts to improve recognition of person names, a diverse category\nthat can grow any time someone is born or changes their name. In order for\ndownstream tasks to not exhibit bias based on cultural background, a model\nshould perform well on names from a variety of backgrounds. In this paper I\nexperiment with the training data and input structure of an English Bi-LSTM\nname recognition model. I look at names from 103 countries to compare how well\nthe model performs on names from different cultures, specifically in the\ncontext of a downstream task where extracted names will be matched to\ninformation on file. I find that a model with combined character and word input\noutperforms word-only models and may improve on accuracy compared to classical\nNER models that are not geared toward identifying unseen entity values.\n","authors":["Alexandra Loessberg-Zahl"],"pdf_url":"https://arxiv.org/pdf/2401.12941v1.pdf","comment":"11 pages"},{"id":"http://arxiv.org/abs/2401.12915v1","updated":"2024-01-23T17:07:18Z","published":"2024-01-23T17:07:18Z","title":"Red Teaming Visual Language Models","summary":" VLMs (Vision-Language Models) extend the capabilities of LLMs (Large Language\nModels) to accept multimodal inputs. Since it has been verified that LLMs can\nbe induced to generate harmful or inaccurate content through specific test\ncases (termed as Red Teaming), how VLMs perform in similar scenarios,\nespecially with their combination of textual and visual inputs, remains a\nquestion. To explore this problem, we present a novel red teaming dataset\nRTVLM, which encompasses 10 subtasks (e.g., image misleading, multi-modal\njail-breaking, face fairness, etc) under 4 primary aspects (faithfulness,\nprivacy, safety, fairness). Our RTVLM is the first red-teaming dataset to\nbenchmark current VLMs in terms of these 4 different aspects. Detailed analysis\nshows that 10 prominent open-sourced VLMs struggle with the red teaming in\ndifferent degrees and have up to 31% performance gap with GPT-4V. Additionally,\nwe simply apply red teaming alignment to LLaVA-v1.5 with Supervised Fine-tuning\n(SFT) using RTVLM, and this bolsters the models' performance with 10% in RTVLM\ntest set, 13% in MM-Hal, and without noticeable decline in MM-Bench,\noverpassing other LLaVA-based models with regular alignment data. This reveals\nthat current open-sourced VLMs still lack red teaming alignment. Our code and\ndatasets will be open-source.\n","authors":["Mukai Li","Lei Li","Yuwei Yin","Masood Ahmed","Zhenguang Liu","Qi Liu"],"pdf_url":"https://arxiv.org/pdf/2401.12915v1.pdf","comment":"Working in progress"},{"id":"http://arxiv.org/abs/2306.02272v3","updated":"2024-01-23T16:28:49Z","published":"2023-06-04T06:33:13Z","title":"OWQ: Lessons learned from activation outliers for weight quantization in\n large language models","summary":" Large language models (LLMs) with hundreds of billions of parameters require\npowerful server-grade GPUs for inference, limiting their practical deployment.\nTo address this challenge, we introduce the outlier-aware weight quantization\n(OWQ) method, which aims to minimize LLM's footprint through low-precision\nrepresentation. OWQ prioritizes a small subset of structured weights sensitive\nto quantization, storing them in high-precision, while applying highly tuned\nquantization to the remaining dense weights. This sensitivity-aware\nmixed-precision scheme reduces the quantization error notably, and extensive\nexperiments demonstrate that 3.1-bit models using OWQ perform comparably to\n4-bit models optimized by OPTQ. Furthermore, OWQ incorporates a\nparameter-efficient fine-tuning for task-specific adaptation, called weak\ncolumn tuning (WCT), enabling accurate task-specific LLM adaptation with\nminimal memory overhead in the optimized format. OWQ represents a notable\nadvancement in the flexibility, efficiency, and practicality of LLM\noptimization literature. The source code is available at\nhttps://github.com/xvyaward/owq\n","authors":["Changhun Lee","Jungyu Jin","Taesu Kim","Hyungjun Kim","Eunhyeok Park"],"pdf_url":"https://arxiv.org/pdf/2306.02272v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12874v1","updated":"2024-01-23T16:09:53Z","published":"2024-01-23T16:09:53Z","title":"From Understanding to Utilization: A Survey on Explainability for Large\n Language Models","summary":" This survey paper delves into the burgeoning field of explainability for\nLarge Language Models (LLMs), a critical yet challenging aspect of natural\nlanguage processing. With LLMs playing a pivotal role in various applications,\ntheir \"black-box\" nature raises concerns about transparency and ethical use.\nThis paper emphasizes the necessity for enhanced explainability in LLMs,\naddressing both the general public's trust and the technical community's need\nfor a deeper understanding of these models. We concentrate on pre-trained\nTransformer-based LLMs, such as LLaMA, which present unique interpretability\nchallenges due to their scale and complexity. Our review categorizes existing\nexplainability methods and discusses their application in improving model\ntransparency and reliability. We also discuss representative evaluation\nmethods, highlighting their strengths and limitations. The goal of this survey\nis to bridge the gap between theoretical understanding and practical\napplication, offering insights for future research and development in the field\nof LLM explainability.\n","authors":["Haoyan Luo","Lucia Specia"],"pdf_url":"https://arxiv.org/pdf/2401.12874v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12873v1","updated":"2024-01-23T16:07:43Z","published":"2024-01-23T16:07:43Z","title":"Improving Machine Translation with Human Feedback: An Exploration of\n Quality Estimation as a Reward Model","summary":" Insufficient modeling of human preferences within the reward model is a major\nobstacle for leveraging human feedback to improve translation quality.\nFortunately, quality estimation (QE), which predicts the quality of a given\ntranslation without reference, has achieved impressive alignment with human\nevaluations in the last two years. In this work, we investigate the potential\nof employing the QE model as the reward model (the QE-based reward model) to\npredict human preferences for feedback training. We first identify the\noveroptimization problem during QE-based feedback training, manifested as an\nincrease in reward while translation quality declines. We examine the problem\nand argue that the vulnerability of the QE model might lead to high rewards for\nincorrect translations, resulting in overoptimization and error propagation. To\naddress the problem, we adopt a simple yet effective method that uses heuristic\nrules to detect the incorrect translations and assigns a penalty term to the\nQE-based rewards for the detected incorrect translations. Experimental results\nshow that the proposed QE-based feedback training achieves consistent and\nsignificant improvements across various settings, further verified through\nhuman preference studies. Our subsequent analysis demonstrates the high data\nefficiency of the proposed QE-based feedback training: the proposed approach\nusing a small amount of monolingual data can outperform systems using larger\nparallel corpora.\n","authors":["Zhiwei He","Xing Wang","Wenxiang Jiao","Zhuosheng Zhang","Rui Wang","Shuming Shi","Zhaopeng Tu"],"pdf_url":"https://arxiv.org/pdf/2401.12873v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12863v1","updated":"2024-01-23T15:56:11Z","published":"2024-01-23T15:56:11Z","title":"KAM-CoT: Knowledge Augmented Multimodal Chain-of-Thoughts Reasoning","summary":" Large Language Models (LLMs) have demonstrated impressive performance in\nnatural language processing tasks by leveraging chain of thought (CoT) that\nenables step-by-step thinking. Extending LLMs with multimodal capabilities is\nthe recent interest, but incurs computational cost and requires substantial\nhardware resources. To address these challenges, we propose KAM-CoT a framework\nthat integrates CoT reasoning, Knowledge Graphs (KGs), and multiple modalities\nfor a comprehensive understanding of multimodal tasks. KAM-CoT adopts a\ntwo-stage training process with KG grounding to generate effective rationales\nand answers. By incorporating external knowledge from KGs during reasoning, the\nmodel gains a deeper contextual understanding reducing hallucinations and\nenhancing the quality of answers. This knowledge-augmented CoT reasoning\nempowers the model to handle questions requiring external context, providing\nmore informed answers. Experimental findings show KAM-CoT outperforms the\nstate-of-the-art methods. On the ScienceQA dataset, we achieve an average\naccuracy of 93.87%, surpassing GPT-3.5 (75.17%) by 18% and GPT-4 (83.99%) by\n10%. Remarkably, KAM-CoT achieves these results with only 280M trainable\nparameters at a time, demonstrating its cost-efficiency and effectiveness.\n","authors":["Debjyoti Mondal","Suraj Modi","Subhadarshi Panda","Rituraj Singh","Godawari Sudhakar Rao"],"pdf_url":"https://arxiv.org/pdf/2401.12863v1.pdf","comment":"AAAI 2024"},{"id":"http://arxiv.org/abs/2304.14391v4","updated":"2024-01-23T15:52:28Z","published":"2023-04-27T17:55:13Z","title":"Energy-based Models are Zero-Shot Planners for Compositional Scene\n Rearrangement","summary":" Language is compositional; an instruction can express multiple relation\nconstraints to hold among objects in a scene that a robot is tasked to\nrearrange. Our focus in this work is an instructable scene-rearranging\nframework that generalizes to longer instructions and to spatial concept\ncompositions never seen at training time. We propose to represent\nlanguage-instructed spatial concepts with energy functions over relative object\narrangements. A language parser maps instructions to corresponding energy\nfunctions and an open-vocabulary visual-language model grounds their arguments\nto relevant objects in the scene. We generate goal scene configurations by\ngradient descent on the sum of energy functions, one per language predicate in\nthe instruction. Local vision-based policies then re-locate objects to the\ninferred goal locations. We test our model on established instruction-guided\nmanipulation benchmarks, as well as benchmarks of compositional instructions we\nintroduce. We show our model can execute highly compositional instructions\nzero-shot in simulation and in the real world. It outperforms\nlanguage-to-action reactive policies and Large Language Model planners by a\nlarge margin, especially for long instructions that involve compositions of\nmultiple spatial concepts. Simulation and real-world robot execution videos, as\nwell as our code and datasets are publicly available on our website:\nhttps://ebmplanner.github.io.\n","authors":["Nikolaos Gkanatsios","Ayush Jain","Zhou Xian","Yunchu Zhang","Christopher Atkeson","Katerina Fragkiadaki"],"pdf_url":"https://arxiv.org/pdf/2304.14391v4.pdf","comment":"First two authors contributed equally | RSS 2023"},{"id":"http://arxiv.org/abs/2310.00367v2","updated":"2024-01-23T15:20:33Z","published":"2023-09-30T13:15:49Z","title":"AutomaTikZ: Text-Guided Synthesis of Scientific Vector Graphics with\n TikZ","summary":" Generating bitmap graphics from text has gained considerable attention, yet\nfor scientific figures, vector graphics are often preferred. Given that vector\ngraphics are typically encoded using low-level graphics primitives, generating\nthem directly is difficult. To address this, we propose the use of TikZ, a\nwell-known abstract graphics language that can be compiled to vector graphics,\nas an intermediate representation of scientific figures. TikZ offers\nhuman-oriented, high-level commands, thereby facilitating conditional language\nmodeling with any large language model. To this end, we introduce DaTikZ, the\nfirst large-scale TikZ dataset consisting of 120k TikZ drawings aligned with\ncaptions. We fine-tune LLaMA on DaTikZ, as well as our new model CLiMA, which\naugments LLaMA with multimodal CLIP embeddings. In both human and automatic\nevaluation, CLiMA and LLaMA outperform commercial GPT-4 and Claude 2 in terms\nof similarity to human-created figures, with CLiMA additionally improving\ntext-image alignment. Our detailed analysis shows that all models generalize\nwell and are not susceptible to memorization. GPT-4 and Claude 2, however, tend\nto generate more simplistic figures compared to both humans and our models. We\nmake our framework, AutomaTikZ, along with model weights and datasets, publicly\navailable.\n","authors":["Jonas Belouadi","Anne Lauscher","Steffen Eger"],"pdf_url":"https://arxiv.org/pdf/2310.00367v2.pdf","comment":"Accepted at ICLR 2024 (poster); Project Page:\n https://github.com/potamides/AutomaTikZ"},{"id":"http://arxiv.org/abs/2401.10543v2","updated":"2024-01-23T14:46:23Z","published":"2024-01-19T08:02:37Z","title":"Multilingual acoustic word embeddings for zero-resource languages","summary":" This research addresses the challenge of developing speech applications for\nzero-resource languages that lack labelled data. It specifically uses acoustic\nword embedding (AWE) -- fixed-dimensional representations of variable-duration\nspeech segments -- employing multilingual transfer, where labelled data from\nseveral well-resourced languages are used for pertaining. The study introduces\na new neural network that outperforms existing AWE models on zero-resource\nlanguages. It explores the impact of the choice of well-resourced languages.\nAWEs are applied to a keyword-spotting system for hate speech detection in\nSwahili radio broadcasts, demonstrating robustness in real-world scenarios.\nAdditionally, novel semantic AWE models improve semantic query-by-example\nsearch.\n","authors":["Christiaan Jacobs"],"pdf_url":"https://arxiv.org/pdf/2401.10543v2.pdf","comment":"PhD thesis"},{"id":"http://arxiv.org/abs/2401.12798v1","updated":"2024-01-23T14:31:12Z","published":"2024-01-23T14:31:12Z","title":"Gradient Flow of Energy: A General and Efficient Approach for Entity\n Alignment Decoding","summary":" Entity alignment (EA), a pivotal process in integrating multi-source\nKnowledge Graphs (KGs), seeks to identify equivalent entity pairs across these\ngraphs. Most existing approaches regard EA as a graph representation learning\ntask, concentrating on enhancing graph encoders. However, the decoding process\nin EA - essential for effective operation and alignment accuracy - has received\nlimited attention and remains tailored to specific datasets and model\narchitectures, necessitating both entity and additional explicit relation\nembeddings. This specificity limits its applicability, particularly in\nGNN-based models. To address this gap, we introduce a novel, generalized, and\nefficient decoding approach for EA, relying solely on entity embeddings. Our\nmethod optimizes the decoding process by minimizing Dirichlet energy, leading\nto the gradient flow within the graph, to promote graph homophily. The\ndiscretization of the gradient flow produces a fast and scalable approach,\ntermed Triple Feature Propagation (TFP). TFP innovatively channels gradient\nflow through three views: entity-to-entity, entity-to-relation, and\nrelation-to-entity. This generalized gradient flow enables TFP to harness the\nmulti-view structural information of KGs. Rigorous experimentation on diverse\nreal-world datasets demonstrates that our approach significantly enhances\nvarious EA methods. Notably, the approach achieves these advancements with less\nthan 6 seconds of additional computational time, establishing a new benchmark\nin efficiency and adaptability for future EA methods.\n","authors":["Yuanyi Wang","Haifeng Sun","Jingyu Wang","Qi Qi","Shaoling Sun","Jianxin Liao"],"pdf_url":"https://arxiv.org/pdf/2401.12798v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12794v1","updated":"2024-01-23T14:29:17Z","published":"2024-01-23T14:29:17Z","title":"Benchmarking LLMs via Uncertainty Quantification","summary":" The proliferation of open-source Large Language Models (LLMs) from various\ninstitutions has highlighted the urgent need for comprehensive evaluation\nmethods. However, current evaluation platforms, such as the widely recognized\nHuggingFace open LLM leaderboard, neglect a crucial aspect -- uncertainty,\nwhich is vital for thoroughly assessing LLMs. To bridge this gap, we introduce\na new benchmarking approach for LLMs that integrates uncertainty\nquantification. Our examination involves eight LLMs (LLM series) spanning five\nrepresentative natural language processing tasks. Additionally, we introduce an\nuncertainty-aware evaluation metric, UAcc, which takes into account both\nprediction accuracy and prediction uncertainty. Our findings reveal that: I)\nLLMs with higher accuracy may exhibit lower certainty; II) Larger-scale LLMs\nmay display greater uncertainty compared to their smaller counterparts; and\nIII) Instruction-finetuning tends to increase the uncertainty of LLMs. By\ntaking uncertainty into account, our new UAcc metric can either amplify or\ndiminish the relative improvement of one LLM over another and may even change\nthe relative ranking of two LLMs. These results underscore the significance of\nincorporating uncertainty in the evaluation of LLMs.\n","authors":["Fanghua Ye","Mingming Yang","Jianhui Pang","Longyue Wang","Derek F. Wong","Emine Yilmaz","Shuming Shi","Zhaopeng Tu"],"pdf_url":"https://arxiv.org/pdf/2401.12794v1.pdf","comment":"24 pages, preprints"},{"id":"http://arxiv.org/abs/2401.12789v1","updated":"2024-01-23T14:19:01Z","published":"2024-01-23T14:19:01Z","title":"Multilingual and Fully Non-Autoregressive ASR with Large Language Model\n Fusion: A Comprehensive Study","summary":" In the era of large models, the autoregressive nature of decoding often\nresults in latency serving as a significant bottleneck. We propose a\nnon-autoregressive LM-fused ASR system that effectively leverages the\nparallelization capabilities of accelerator hardware. Our approach combines the\nUniversal Speech Model (USM) and the PaLM 2 language model in per-segment\nscoring mode, achieving an average relative WER improvement across all\nlanguages of 10.8% on FLEURS and 3.6% on YouTube captioning. Furthermore, our\ncomprehensive ablation study analyzes key parameters such as LLM size, context\nlength, vocabulary size, fusion methodology. For instance, we explore the\nimpact of LLM size ranging from 128M to 340B parameters on ASR performance.\nThis study provides valuable insights into the factors influencing the\neffectiveness of practical large-scale LM-fused speech recognition systems.\n","authors":["W. Ronny Huang","Cyril Allauzen","Tongzhou Chen","Kilol Gupta","Ke Hu","James Qin","Yu Zhang","Yongqiang Wang","Shuo-Yiin Chang","Tara N. Sainath"],"pdf_url":"https://arxiv.org/pdf/2401.12789v1.pdf","comment":"ICASSP 2024"},{"id":"http://arxiv.org/abs/2308.12890v3","updated":"2024-01-23T13:42:03Z","published":"2023-08-24T16:09:13Z","title":"Large Language Models Vote: Prompting for Rare Disease Identification","summary":" The emergence of generative Large Language Models (LLMs) emphasizes the need\nfor accurate and efficient prompting approaches. LLMs are often applied in\nFew-Shot Learning (FSL) contexts, where tasks are executed with minimal\ntraining data. FSL has become popular in many Artificial Intelligence (AI)\nsubdomains, including AI for health. Rare diseases affect a small fraction of\nthe population. Rare disease identification from clinical notes inherently\nrequires FSL techniques due to limited data availability. Manual data\ncollection and annotation is both expensive and time-consuming. In this paper,\nwe propose Models-Vote Prompting (MVP), a flexible prompting approach for\nimproving the performance of LLM queries in FSL settings. MVP works by\nprompting numerous LLMs to perform the same tasks and then conducting a\nmajority vote on the resulting outputs. This method achieves improved results\nto any one model in the ensemble on one-shot rare disease identification and\nclassification tasks. We also release a novel rare disease dataset for FSL,\navailable to those who signed the MIMIC-IV Data Use Agreement (DUA).\nFurthermore, in using MVP, each model is prompted multiple times, substantially\nincreasing the time needed for manual annotation, and to address this, we\nassess the feasibility of using JSON for automating generative LLM evaluation.\n","authors":["David Oniani","Jordan Hilsman","Hang Dong","Fengyi Gao","Shiven Verma","Yanshan Wang"],"pdf_url":"https://arxiv.org/pdf/2308.12890v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12756v1","updated":"2024-01-23T13:35:47Z","published":"2024-01-23T13:35:47Z","title":"What the Weight?! A Unified Framework for Zero-Shot Knowledge\n Composition","summary":" The knowledge encapsulated in a model is the core factor determining its\nfinal performance on downstream tasks. Much research in NLP has focused on\nefficient methods for storing and adapting different types of knowledge, e.g.,\nin dedicated modularized structures, and on how to effectively combine these,\ne.g., by learning additional parameters. However, given the many possible\noptions, a thorough understanding of the mechanisms involved in these\ncompositions is missing, and hence it remains unclear which strategies to\nutilize. To address this research gap, we propose a novel framework for\nzero-shot module composition, which encompasses existing and some novel\nvariations for selecting, weighting, and combining parameter modules under a\nsingle unified notion. Focusing on the scenario of domain knowledge and adapter\nlayers, our framework provides a systematic unification of concepts, allowing\nus to conduct the first comprehensive benchmarking study of various zero-shot\nknowledge composition strategies. In particular, we test two module combination\nmethods and five selection and weighting strategies for their effectiveness and\nefficiency in an extensive experimental setup. Our results highlight the\nefficacy of ensembling but also hint at the power of simple though\noften-ignored weighting methods. Further in-depth analyses allow us to\nunderstand the role of weighting vs. top-k selection, and show that, to a\ncertain extent, the performance of adapter composition can even be predicted.\n","authors":["Carolin Holtermann","Markus Frohmann","Navid Rekabsaz","Anne Lauscher"],"pdf_url":"https://arxiv.org/pdf/2401.12756v1.pdf","comment":"Accepted to Findings of the ACL: EACL 2024"},{"id":"http://arxiv.org/abs/2401.08517v2","updated":"2024-01-23T13:29:20Z","published":"2024-01-16T17:31:35Z","title":"Supporting Student Decisions on Learning Recommendations: An LLM-Based\n Chatbot with Knowledge Graph Contextualization for Conversational\n Explainability and Mentoring","summary":" Student commitment towards a learning recommendation is not separable from\ntheir understanding of the reasons it was recommended to them; and their\nability to modify it based on that understanding. Among explainability\napproaches, chatbots offer the potential to engage the student in a\nconversation, similar to a discussion with a peer or a mentor. The capabilities\nof chatbots, however, are still not sufficient to replace a human mentor,\ndespite the advancements of generative AI (GenAI) and large language models\n(LLM). Therefore, we propose an approach to utilize chatbots as mediators of\nthe conversation and sources of limited and controlled generation of\nexplanations, to harvest the potential of LLMs while reducing their potential\nrisks at the same time. The proposed LLM-based chatbot supports students in\nunderstanding learning-paths recommendations. We use a knowledge graph (KG) as\na human-curated source of information, to regulate the LLM's output through\ndefining its prompt's context. A group chat approach is developed to connect\nstudents with human mentors, either on demand or in cases that exceed the\nchatbot's pre-defined tasks. We evaluate the chatbot with a user study, to\nprovide a proof-of-concept and highlight the potential requirements and\nlimitations of utilizing chatbots in conversational explainability.\n","authors":["Hasan Abu-Rasheed","Mohamad Hussam Abdulsalam","Christian Weber","Madjid Fathi"],"pdf_url":"https://arxiv.org/pdf/2401.08517v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.07913v4","updated":"2024-01-23T13:26:56Z","published":"2023-12-13T06:11:42Z","title":"A Survey of Text Watermarking in the Era of Large Language Models","summary":" Text watermarking algorithms play a crucial role in the copyright protection\nof textual content, yet their capabilities and application scenarios have been\nlimited historically. The recent developments in large language models (LLMs)\nhave opened new opportunities for the advancement of text watermarking\ntechniques. LLMs not only enhance the capabilities of text watermarking\nalgorithms through their text understanding and generation abilities but also\nnecessitate the use of text watermarking algorithms for their own copyright\nprotection. This paper conducts a comprehensive survey of the current state of\ntext watermarking technology, covering four main aspects: (1) an overview and\ncomparison of different text watermarking techniques; (2) evaluation methods\nfor text watermarking algorithms, including their success rates, impact on text\nquality, robustness, and unforgeability; (3) potential application scenarios\nfor text watermarking technology; (4) current challenges and future directions\nfor development. This survey aims to provide researchers with a thorough\nunderstanding of text watermarking technology, thereby promoting its further\nadvancement.\n","authors":["Aiwei Liu","Leyi Pan","Yijian Lu","Jingjing Li","Xuming Hu","Xi Zhang","Lijie Wen","Irwin King","Hui Xiong","Philip S. Yu"],"pdf_url":"https://arxiv.org/pdf/2312.07913v4.pdf","comment":"35 pages, 7 figures"},{"id":"http://arxiv.org/abs/2401.12720v1","updated":"2024-01-23T12:41:03Z","published":"2024-01-23T12:41:03Z","title":"A Comprehensive View of the Biases of Toxicity and Sentiment Analysis\n Methods Towards Utterances with African American English Expressions","summary":" Language is a dynamic aspect of our culture that changes when expressed in\ndifferent technologies/communities. Online social networks have enabled the\ndiffusion and evolution of different dialects, including African American\nEnglish (AAE). However, this increased usage is not without barriers. One\nparticular barrier is how sentiment (Vader, TextBlob, and Flair) and toxicity\n(Google's Perspective and the open-source Detoxify) methods present biases\ntowards utterances with AAE expressions. Consider Google's Perspective to\nunderstand bias. Here, an utterance such as ``All n*ggers deserve to die\nrespectfully. The police murder us.'' it reaches a higher toxicity than\n``African-Americans deserve to die respectfully. The police murder us.''. This\nscore difference likely arises because the tool cannot understand the\nre-appropriation of the term ``n*gger''. One explanation for this bias is that\nAI models are trained on limited datasets, and using such a term in training\ndata is more likely to appear in a toxic utterance. While this may be\nplausible, the tool will make mistakes regardless. Here, we study bias on two\nWeb-based (YouTube and Twitter) datasets and two spoken English datasets. Our\nanalysis shows how most models present biases towards AAE in most settings. We\nisolate the impact of AAE expression usage via linguistic control features from\nthe Linguistic Inquiry and Word Count (LIWC) software, grammatical control\nfeatures extracted via Part-of-Speech (PoS) tagging from Natural Language\nProcessing (NLP) models, and the semantic of utterances by comparing sentence\nembeddings from recent language models. We present consistent results on how a\nheavy usage of AAE expressions may cause the speaker to be considered\nsubstantially more toxic, even when speaking about nearly the same subject. Our\nstudy complements similar analyses focusing on small datasets and/or one method\nonly.\n","authors":["Guilherme H. Resende","Luiz F. Nery","Fabrício Benevenuto","Savvas Zannettou","Flavio Figueiredo"],"pdf_url":"https://arxiv.org/pdf/2401.12720v1.pdf","comment":"Under peer review"},{"id":"http://arxiv.org/abs/2401.12713v1","updated":"2024-01-23T12:29:37Z","published":"2024-01-23T12:29:37Z","title":"Generating Unsupervised Abstractive Explanations for Rumour Verification","summary":" The task of rumour verification in social media concerns assessing the\nveracity of a claim on the basis of conversation threads that result from it.\nWhile previous work has focused on predicting a veracity label, here we\nreformulate the task to generate model-centric, free-text explanations of a\nrumour's veracity. We follow an unsupervised approach by first utilising\npost-hoc explainability methods to score the most important posts within a\nthread and then we use these posts to generate informative explanatory\nsummaries by employing template-guided summarisation. To evaluate the\ninformativeness of the explanatory summaries, we exploit the few-shot learning\ncapabilities of a large language model (LLM). Our experiments show that LLMs\ncan have similar agreement to humans in evaluating summaries. Importantly, we\nshow that explanatory abstractive summaries are more informative and better\nreflect the predicted rumour veracity than just using the highest ranking posts\nin the thread.\n","authors":["Iman Munire Bilal","Preslav Nakov","Rob Procter","Maria Liakata"],"pdf_url":"https://arxiv.org/pdf/2401.12713v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12689v1","updated":"2024-01-23T11:54:09Z","published":"2024-01-23T11:54:09Z","title":"Energy-based Automated Model Evaluation","summary":" The conventional evaluation protocols on machine learning models rely heavily\non a labeled, i.i.d-assumed testing dataset, which is not often present in real\nworld applications. The Automated Model Evaluation (AutoEval) shows an\nalternative to this traditional workflow, by forming a proximal prediction\npipeline of the testing performance without the presence of ground-truth\nlabels. Despite its recent successes, the AutoEval frameworks still suffer from\nan overconfidence issue, substantial storage and computational cost. In that\nregard, we propose a novel measure -- Meta-Distribution Energy (MDE) -- that\nallows the AutoEval framework to be both more efficient and effective. The core\nof the MDE is to establish a meta-distribution statistic, on the information\n(energy) associated with individual samples, then offer a smoother\nrepresentation enabled by energy-based learning. We further provide our\ntheoretical insights by connecting the MDE with the classification loss. We\nprovide extensive experiments across modalities, datasets and different\narchitectural backbones to validate MDE's validity, together with its\nsuperiority compared with prior approaches. We also prove MDE's versatility by\nshowing its seamless integration with large-scale models, and easy adaption to\nlearning scenarios with noisy- or imbalanced- labels.\n","authors":["Ru Peng","Heming Zou","Haobo Wang","Yawen Zeng","Zenan Huang","Junbo Zhao"],"pdf_url":"https://arxiv.org/pdf/2401.12689v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12671v1","updated":"2024-01-23T11:25:34Z","published":"2024-01-23T11:25:34Z","title":"Context Matters: Pushing the Boundaries of Open-Ended Answer Generation\n with Graph-Structured Knowledge Context","summary":" In the continuously advancing AI landscape, crafting context-rich and\nmeaningful responses via Large Language Models (LLMs) is essential. Researchers\nare becoming more aware of the challenges that LLMs with fewer parameters\nencounter when trying to provide suitable answers to open-ended questions. To\naddress these hurdles, the integration of cutting-edge strategies, augmentation\nof rich external domain knowledge to LLMs, offers significant improvements.\nThis paper introduces a novel framework that combines graph-driven context\nretrieval in conjunction to knowledge graphs based enhancement, honing the\nproficiency of LLMs, especially in domain specific community question answering\nplatforms like AskUbuntu, Unix, and ServerFault. We conduct experiments on\nvarious LLMs with different parameter sizes to evaluate their ability to ground\nknowledge and determine factual accuracy in answers to open-ended questions.\nOur methodology GraphContextGen consistently outperforms dominant text-based\nretrieval systems, demonstrating its robustness and adaptability to a larger\nnumber of use cases. This advancement highlights the importance of pairing\ncontext rich data retrieval with LLMs, offering a renewed approach to knowledge\nsourcing and generation in AI systems. We also show that, due to rich\ncontextual data retrieval, the crucial entities, along with the generated\nanswer, remain factually coherent with the gold answer.\n","authors":["Somnath Banerjee","Amruit Sahoo","Sayan Layek","Avik Dutta","Rima Hazra","Animesh Mukherjee"],"pdf_url":"https://arxiv.org/pdf/2401.12671v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12631v1","updated":"2024-01-23T10:27:42Z","published":"2024-01-23T10:27:42Z","title":"A Reply to Makelov et al. (2023)'s \"Interpretability Illusion\" Arguments","summary":" We respond to the recent paper by Makelov et al. (2023), which reviews\nsubspace interchange intervention methods like distributed alignment search\n(DAS; Geiger et al. 2023) and claims that these methods potentially cause\n\"interpretability illusions\". We first review Makelov et al. (2023)'s technical\nnotion of what an \"interpretability illusion\" is, and then we show that even\nintuitive and desirable explanations can qualify as illusions in this sense. As\na result, their method of discovering \"illusions\" can reject explanations they\nconsider \"non-illusory\". We then argue that the illusions Makelov et al. (2023)\nsee in practice are artifacts of their training and evaluation paradigms. We\nclose by emphasizing that, though we disagree with their core characterization,\nMakelov et al. (2023)'s examples and discussion have undoubtedly pushed the\nfield of interpretability forward.\n","authors":["Zhengxuan Wu","Atticus Geiger","Jing Huang","Aryaman Arora","Thomas Icard","Christopher Potts","Noah D. Goodman"],"pdf_url":"https://arxiv.org/pdf/2401.12631v1.pdf","comment":"20 pages, 14 figures"},{"id":"http://arxiv.org/abs/2401.11969v2","updated":"2024-01-23T09:35:02Z","published":"2024-01-22T14:17:03Z","title":"Claim Detection for Automated Fact-checking: A Survey on Monolingual,\n Multilingual and Cross-Lingual Research","summary":" Automated fact-checking has drawn considerable attention over the past few\ndecades due to the increase in the diffusion of misinformation on online\nplatforms. This is often carried out as a sequence of tasks comprising (i) the\ndetection of sentences circulating in online platforms which constitute claims\nneeding verification, followed by (ii) the verification process of those\nclaims. This survey focuses on the former, by discussing existing efforts\ntowards detecting claims needing fact-checking, with a particular focus on\nmultilingual data and methods. This is a challenging and fertile direction\nwhere existing methods are yet far from matching human performance due to the\nprofoundly challenging nature of the issue. Especially, the dissemination of\ninformation across multiple social platforms, articulated in multiple languages\nand modalities demands more generalized solutions for combating misinformation.\nFocusing on multilingual misinformation, we present a comprehensive survey of\nexisting multilingual claim detection research. We present state-of-the-art\nmultilingual claim detection research categorized into three key factors of the\nproblem, verifiability, priority, and similarity. Further, we present a\ndetailed overview of the existing multilingual datasets along with the\nchallenges and suggest possible future advancements.\n","authors":["Rrubaa Panchendrarajan","Arkaitz Zubiaga"],"pdf_url":"https://arxiv.org/pdf/2401.11969v2.pdf","comment":"Typo corrected"},{"id":"http://arxiv.org/abs/2401.12585v1","updated":"2024-01-23T09:33:31Z","published":"2024-01-23T09:33:31Z","title":"SLANG: New Concept Comprehension of Large Language Models","summary":" The dynamic nature of language, particularly evident in the realm of slang\nand memes on the Internet, poses serious challenges to the adaptability of\nlarge language models (LLMs). Traditionally anchored to static datasets, these\nmodels often struggle to keep up with the rapid linguistic evolution\ncharacteristic of online communities. This research addresses the critical need\nto bridge this gap, aiming to enhance LLMs' comprehension of evolving new\nconcepts on the internet, without the high cost and impracticality of continual\nretraining. To address this issue, we propose a new benchmark $\\textbf{SLANG}$\nto assess LLMs' proficiency in comprehending emerging linguistic trends and a\nbaseline approach $\\textbf{FOCUS}$, which uses causal inference to enhance LLMs\nto understand new phrases and usage patterns. This approach involves\nscrutinizing real-world instances of linguistic shifts, serving as contextual\nbeacons, to form more precise and contextually relevant connections between\nnewly emerging expressions and their intended meanings. The empirical analysis\nshows that our causal inference-based approach outperforms the traditional\nmodels in terms of precision and relevance in the interpretation of Internet\nslang and memes.\n","authors":["Lingrui Mei","Shenghua Liu","Yiwei Wang","Baolong Bi","Xueqi Chen"],"pdf_url":"https://arxiv.org/pdf/2401.12585v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.01185v3","updated":"2024-01-23T09:16:00Z","published":"2023-12-02T17:24:17Z","title":"A ripple in time: a discontinuity in American history","summary":" In this note we use the State of the Union Address (SOTU) dataset from Kaggle\nto make some surprising (and some not so surprising) observations pertaining to\nthe general timeline of American history, and the character and nature of the\naddresses themselves. Our main approach is using vector embeddings, such as\nBERT (DistilBERT) and GPT-2.\n While it is widely believed that BERT (and its variations) is most suitable\nfor NLP classification tasks, we find out that GPT-2 in conjunction with\nnonlinear dimension reduction methods such as UMAP provide better separation\nand stronger clustering. This makes GPT-2 + UMAP an interesting alternative. In\nour case, no model fine-tuning is required, and the pre-trained out-of-the-box\nGPT-2 model is enough.\n We also used a fine-tuned DistilBERT model for classification detecting which\nPresident delivered which address, with very good results (accuracy 93% - 95%\ndepending on the run). An analogous task was performed to determine the year of\nwriting, and we were able to pin it down to about 4 years (which is a single\npresidential term).\n It is worth noting that SOTU addresses provide relatively small writing\nsamples (with about 8'000 words on average, and varying widely from under 2'000\nwords to more than 20'000), and that the number of authors is relatively large\n(we used SOTU addresses of 42 US presidents). This shows that the techniques\nemployed turn out to be rather efficient, while all the computations described\nin this note can be performed using a single GPU instance of Google Colab.\n The accompanying code is available on GitHub.\n","authors":["Alexander Kolpakov","Igor Rivin"],"pdf_url":"https://arxiv.org/pdf/2312.01185v3.pdf","comment":"7 pages, 8 figures; GitHub repository\n https://github.com/sashakolpakov/ripple_in_time"},{"id":"http://arxiv.org/abs/2401.12576v1","updated":"2024-01-23T09:11:07Z","published":"2024-01-23T09:11:07Z","title":"LLMCheckup: Conversational Examination of Large Language Models via\n Interpretability Tools","summary":" Interpretability tools that offer explanations in the form of a dialogue have\ndemonstrated their efficacy in enhancing users' understanding, as one-off\nexplanations may occasionally fall short in providing sufficient information to\nthe user. Current solutions for dialogue-based explanations, however, require\nmany dependencies and are not easily transferable to tasks they were not\ndesigned for. With LLMCheckup, we present an easily accessible tool that allows\nusers to chat with any state-of-the-art large language model (LLM) about its\nbehavior. We enable LLMs to generate all explanations by themselves and take\ncare of intent recognition without fine-tuning, by connecting them with a broad\nspectrum of Explainable AI (XAI) tools, e.g. feature attributions,\nembedding-based similarity, and prompting strategies for counterfactual and\nrationale generation. LLM (self-)explanations are presented as an interactive\ndialogue that supports follow-up questions and generates suggestions.\nLLMCheckup provides tutorials for operations available in the system, catering\nto individuals with varying levels of expertise in XAI and supports multiple\ninput modalities. We introduce a new parsing strategy called multi-prompt\nparsing substantially enhancing the parsing accuracy of LLMs. Finally, we\nshowcase the tasks of fact checking and commonsense question answering.\n","authors":["Qianli Wang","Tatiana Anikina","Nils Feldhus","Josef van Genabith","Leonhard Hennig","Sebastian Möller"],"pdf_url":"https://arxiv.org/pdf/2401.12576v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2212.12192v2","updated":"2024-01-23T09:05:41Z","published":"2022-12-23T08:23:32Z","title":"Learning to Generate Questions by Enhancing Text Generation with\n Sentence Selection","summary":" We introduce an approach for the answer-aware question generation problem.\nInstead of only relying on the capability of strong pre-trained language\nmodels, we observe that the information of answers and questions can be found\nin some relevant sentences in the context. Based on that, we design a model\nwhich includes two modules: a selector and a generator. The selector forces the\nmodel to more focus on relevant sentences regarding an answer to provide\nimplicit local information. The generator generates questions by implicitly\ncombining local information from the selector and global information from the\nwhole context encoded by the encoder. The model is trained jointly to take\nadvantage of latent interactions between the two modules. Experimental results\non two benchmark datasets show that our model is better than strong pre-trained\nmodels for the question generation task. The code is also available.\n","authors":["Pham Quoc-Hung","Minh-Tien Nguyen","Manh Tran-Tien","Hung Le","Xuan-Hieu Phan"],"pdf_url":"https://arxiv.org/pdf/2212.12192v2.pdf","comment":"This paper describes an on-going work"},{"id":"http://arxiv.org/abs/2401.06827v2","updated":"2024-01-23T08:54:15Z","published":"2024-01-12T04:54:01Z","title":"APLe: Token-Wise Adaptive for Multi-Modal Prompt Learning","summary":" Pre-trained Vision-Language (V-L) models set the benchmark for generalization\nto downstream tasks among the noteworthy contenders. Many characteristics of\nthe V-L model have been explored in existing research including the challenge\nof the sensitivity to text input and the tuning process across multi-modal\nprompts. With the advanced utilization of the V-L model like CLIP, recent\napproaches deploy learnable prompts instead of hand-craft prompts to boost the\ngeneralization performance and address the aforementioned challenges. Inspired\nby layer-wise training, which is wildly used in image fusion, we note that\nusing a sequential training process to adapt different modalities branches of\nCLIP efficiently facilitates the improvement of generalization. In the context\nof addressing the multi-modal prompting challenge, we propose Token-wise\nAdaptive for Multi-modal Prompt Learning (APLe) for tuning both modalities\nprompts, vision and language, as tokens in a sequential manner. APLe addresses\nthe challenges in V-L models to promote prompt learning across both modalities,\nwhich indicates a competitive generalization performance in line with the\nstate-of-the-art. Preeminently, APLe shows robustness and favourable\nperformance in prompt-length experiments with an absolute advantage in adopting\nthe V-L models.\n","authors":["Guiming Cao","Kaize Shi","Hong Fu","Huaiwen Zhang","Guandong Xu"],"pdf_url":"https://arxiv.org/pdf/2401.06827v2.pdf","comment":"7 pages,3 figures"},{"id":"http://arxiv.org/abs/2401.12566v1","updated":"2024-01-23T08:49:23Z","published":"2024-01-23T08:49:23Z","title":"Automated Fact-Checking of Climate Change Claims with Large Language\n Models","summary":" This paper presents Climinator, a novel AI-based tool designed to automate\nthe fact-checking of climate change claims. Utilizing an array of Large\nLanguage Models (LLMs) informed by authoritative sources like the IPCC reports\nand peer-reviewed scientific literature, Climinator employs an innovative\nMediator-Advocate framework. This design allows Climinator to effectively\nsynthesize varying scientific perspectives, leading to robust, evidence-based\nevaluations. Our model demonstrates remarkable accuracy when testing claims\ncollected from Climate Feedback and Skeptical Science. Notably, when\nintegrating an advocate with a climate science denial perspective in our\nframework, Climinator's iterative debate process reliably converges towards\nscientific consensus, underscoring its adeptness at reconciling diverse\nviewpoints into science-based, factual conclusions. While our research is\nsubject to certain limitations and necessitates careful interpretation, our\napproach holds significant potential. We hope to stimulate further research and\nencourage exploring its applicability in other contexts, including political\nfact-checking and legal domains.\n","authors":["Markus Leippold","Saeid Ashraf Vaghefi","Dominik Stammbach","Veruska Muccione","Julia Bingler","Jingwei Ni","Chiara Colesanti-Senni","Tobias Wekhof","Tobias Schimanski","Glen Gostlow","Tingyu Yu","Juerg Luterbacher","Christian Huggel"],"pdf_url":"https://arxiv.org/pdf/2401.12566v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.03025v2","updated":"2024-01-23T07:49:13Z","published":"2023-10-04T17:59:41Z","title":"Retrieval meets Long Context Large Language Models","summary":" Extending the context window of large language models (LLMs) is getting\npopular recently, while the solution of augmenting LLMs with retrieval has\nexisted for years. The natural questions are: i) Retrieval-augmentation versus\nlong context window, which one is better for downstream tasks? ii) Can both\nmethods be combined to get the best of both worlds? In this work, we answer\nthese questions by studying both solutions using two state-of-the-art\npretrained LLMs, i.e., a proprietary 43B GPT and Llama2-70B. Perhaps\nsurprisingly, we find that LLM with 4K context window using simple\nretrieval-augmentation at generation can achieve comparable performance to\nfinetuned LLM with 16K context window via positional interpolation on long\ncontext tasks, while taking much less computation. More importantly, we\ndemonstrate that retrieval can significantly improve the performance of LLMs\nregardless of their extended context window sizes. Our best model,\nretrieval-augmented Llama2-70B with 32K context window, outperforms\nGPT-3.5-turbo-16k and Davinci003 in terms of average score on nine long context\ntasks including question answering, query-based summarization, and in-context\nfew-shot learning tasks. It also outperforms its non-retrieval Llama2-70B-32k\nbaseline by a margin, while being much faster at generation. Our study provides\ngeneral insights on the choice of retrieval-augmentation versus long context\nextension of LLM for practitioners.\n","authors":["Peng Xu","Wei Ping","Xianchao Wu","Lawrence McAfee","Chen Zhu","Zihan Liu","Sandeep Subramanian","Evelina Bakhturina","Mohammad Shoeybi","Bryan Catanzaro"],"pdf_url":"https://arxiv.org/pdf/2310.03025v2.pdf","comment":"Published at ICLR 2024"},{"id":"http://arxiv.org/abs/2401.12540v1","updated":"2024-01-23T07:48:58Z","published":"2024-01-23T07:48:58Z","title":"DREditor: An Time-efficient Approach for Building a Domain-specific\n Dense Retrieval Model","summary":" Deploying dense retrieval models efficiently is becoming increasingly\nimportant across various industries. This is especially true for enterprise\nsearch services, where customizing search engines to meet the time demands of\ndifferent enterprises in different domains is crucial. Motivated by this, we\ndevelop a time-efficient approach called DREditor to edit the matching rule of\nan off-the-shelf dense retrieval model to suit a specific domain. This is\nachieved by directly calibrating the output embeddings of the model using an\nefficient and effective linear mapping. This mapping is powered by an edit\noperator that is obtained by solving a specially constructed least squares\nproblem. Compared to implicit rule modification via long-time finetuning, our\nexperimental results show that DREditor provides significant advantages on\ndifferent domain-specific datasets, dataset sources, retrieval models, and\ncomputing devices. It consistently enhances time efficiency by 100-300 times\nwhile maintaining comparable or even superior retrieval performance. In a\nbroader context, we take the first step to introduce a novel embedding\ncalibration approach for the retrieval task, filling the technical blank in the\ncurrent field of embedding calibration. This approach also paves the way for\nbuilding domain-specific dense retrieval models efficiently and inexpensively.\n","authors":["Chen Huang","Duanyu Feng","Wenqiang Lei","Jiancheng Lv"],"pdf_url":"https://arxiv.org/pdf/2401.12540v1.pdf","comment":"15 pages, 6 figures, Codes are available at\n https://github.com/huangzichun/DREditor"},{"id":"http://arxiv.org/abs/2401.10134v2","updated":"2024-01-23T07:42:40Z","published":"2024-01-18T17:03:59Z","title":"Spatial-Temporal Large Language Model for Traffic Prediction","summary":" Traffic prediction, a critical component for intelligent transportation\nsystems, endeavors to foresee future traffic at specific locations using\nhistorical data. Although existing traffic prediction models often emphasize\ndeveloping complex neural network structures, their accuracy has not seen\nimprovements accordingly. Recently, Large Language Models (LLMs) have shown\noutstanding capabilities in time series analysis. Differing from existing\nmodels, LLMs progress mainly through parameter expansion and extensive\npre-training while maintaining their fundamental structures. In this paper, we\npropose a Spatial-Temporal Large Language Model (ST-LLM) for traffic\nprediction. Specifically, ST-LLM redefines the timesteps at each location as\ntokens and incorporates a spatial-temporal embedding module to learn the\nspatial location and global temporal representations of tokens. Then these\nrepresentations are fused to provide each token with unified spatial and\ntemporal information. Furthermore, we propose a novel partially frozen\nattention strategy of the LLM, which is designed to capture spatial-temporal\ndependencies for traffic prediction. Comprehensive experiments on real traffic\ndatasets offer evidence that ST-LLM outperforms state-of-the-art models.\nNotably, the ST-LLM also exhibits robust performance in both few-shot and\nzero-shot prediction scenarios.\n","authors":["Chenxi Liu","Sun Yang","Qianxiong Xu","Zhishuai Li","Cheng Long","Ziyue Li","Rui Zhao"],"pdf_url":"https://arxiv.org/pdf/2401.10134v2.pdf","comment":"Revise"},{"id":"http://arxiv.org/abs/2311.12373v2","updated":"2024-01-23T07:12:01Z","published":"2023-11-21T06:23:38Z","title":"Beyond Turing: A Comparative Analysis of Approaches for Detecting\n Machine-Generated Text","summary":" Significant progress has been made on text generation by pre-trained language\nmodels (PLMs), yet distinguishing between human and machine-generated text\nposes an escalating challenge. This paper offers an in-depth evaluation of\nthree distinct methods used to address this task: traditional shallow learning,\nLanguage Model (LM) fine-tuning, and Multilingual Model fine-tuning. These\napproaches are rigorously tested on a wide range of machine-generated texts,\nproviding a benchmark of their competence in distinguishing between\nhuman-authored and machine-authored linguistic constructs. The results reveal\nconsiderable differences in performance across methods, thus emphasizing the\ncontinued need for advancement in this crucial area of NLP. This study offers\nvaluable insights and paves the way for future research aimed at creating\nrobust and highly discriminative models.\n","authors":["Muhammad Farid Adilazuarda"],"pdf_url":"https://arxiv.org/pdf/2311.12373v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.04867v2","updated":"2024-01-23T06:48:45Z","published":"2024-01-10T01:02:26Z","title":"An Analysis of User Behaviors for Objectively Evaluating Spoken Dialogue\n Systems","summary":" Establishing evaluation schemes for spoken dialogue systems is important, but\nit can also be challenging. While subjective evaluations are commonly used in\nuser experiments, objective evaluations are necessary for research comparison\nand reproducibility. To address this issue, we propose a framework for\nindirectly but objectively evaluating systems based on users' behaviors. In\nthis paper, to this end, we investigate the relationship between user behaviors\nand subjective evaluation scores in social dialogue tasks: attentive listening,\njob interview, and first-meeting conversation. The results reveal that in\ndialogue tasks where user utterances are primary, such as attentive listening\nand job interview, indicators like the number of utterances and words play a\nsignificant role in evaluation. Observing disfluency also can indicate the\neffectiveness of formal tasks, such as job interview. On the other hand, in\ndialogue tasks with high interactivity, such as first-meeting conversation,\nbehaviors related to turn-taking, like average switch pause length, become more\nimportant. These findings suggest that selecting appropriate user behaviors can\nprovide valuable insights for objective evaluation in each social dialogue\ntask.\n","authors":["Koji Inoue","Divesh Lala","Keiko Ochi","Tatsuya Kawahara","Gabriel Skantze"],"pdf_url":"https://arxiv.org/pdf/2401.04867v2.pdf","comment":"This paper has been accepted for presentation at International\n Workshop on Spoken Dialogue Systems Technology 2024 (IWSDS 2024) and\n represents the author's version of the work"},{"id":"http://arxiv.org/abs/2401.12522v1","updated":"2024-01-23T06:36:49Z","published":"2024-01-23T06:36:49Z","title":"BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language\n Models","summary":" Large language models (LLMs) commonly employ autoregressive generation during\ninference, leading to high memory bandwidth demand and consequently extended\nlatency. To mitigate this inefficiency, we present Bi-directional Tuning for\nlossless Acceleration (BiTA), an innovative method expediting LLMs via\nstreamlined semi-autoregressive generation and draft verification. Inspired by\nthe concept of prompt tuning, we enhance LLMs with a parameter-efficient design\ncalled bi-directional tuning for the capability in semi-autoregressive\ngeneration. Employing efficient tree-based decoding, the models perform draft\ncandidate generation and verification in parallel, ensuring outputs identical\nto their autoregressive counterparts under greedy sampling. BiTA serves as a\nlightweight plug-in module, seamlessly boosting the inference efficiency of\nexisting LLMs without requiring additional assistance models or incurring\nsignificant extra memory costs. Applying the proposed BiTA, LLaMA-2-70B-Chat\nachieves a 2.7$\\times$ speedup on the MT-Bench benchmark. Extensive experiments\nconfirm our method surpasses state-of-the-art acceleration techniques.\n","authors":["Feng Lin","Hanling Yi","Hongbin Li","Yifan Yang","Xiaotian Yu","Guangming Lu","Rong Xiao"],"pdf_url":"https://arxiv.org/pdf/2401.12522v1.pdf","comment":"Source code at https://github.com/linfeng93/BiTA"},{"id":"http://arxiv.org/abs/2401.12520v1","updated":"2024-01-23T06:30:05Z","published":"2024-01-23T06:30:05Z","title":"Key Information Retrieval to Classify the Unstructured Data Content of\n Preferential Trade Agreements","summary":" With the rapid proliferation of textual data, predicting long texts has\nemerged as a significant challenge in the domain of natural language\nprocessing. Traditional text prediction methods encounter substantial\ndifficulties when grappling with long texts, primarily due to the presence of\nredundant and irrelevant information, which impedes the model's capacity to\ncapture pivotal insights from the text. To address this issue, we introduce a\nnovel approach to long-text classification and prediction. Initially, we employ\nembedding techniques to condense the long texts, aiming to diminish the\nredundancy therein. Subsequently,the Bidirectional Encoder Representations from\nTransformers (BERT) embedding method is utilized for text classification\ntraining. Experimental outcomes indicate that our method realizes considerable\nperformance enhancements in classifying long texts of Preferential Trade\nAgreements. Furthermore, the condensation of text through embedding methods not\nonly augments prediction accuracy but also substantially reduces computational\ncomplexity. Overall, this paper presents a strategy for long-text prediction,\noffering a valuable reference for researchers and engineers in the natural\nlanguage processing sphere.\n","authors":["Jiahui Zhao","Ziyi Meng","Stepan Gordeev","Zijie Pan","Dongjin Song","Sandro Steinbach","Caiwen Ding"],"pdf_url":"https://arxiv.org/pdf/2401.12520v1.pdf","comment":"AI4TS Workshop@AAAI 2024 accepted publication"},{"id":"http://arxiv.org/abs/2401.12492v1","updated":"2024-01-23T05:20:35Z","published":"2024-01-23T05:20:35Z","title":"Comparing Human-Centered Language Modeling: Is it Better to Model\n Groups, Individual Traits, or Both?","summary":" Natural language processing has made progress in incorporating human context\ninto its models, but whether it is more effective to use group-wise attributes\n(e.g., over-45-year-olds) or model individuals remains open. Group attributes\nare technically easier but coarse: not all 45-year-olds write the same way. In\ncontrast, modeling individuals captures the complexity of each person's\nidentity. It allows for a more personalized representation, but we may have to\nmodel an infinite number of users and require data that may be impossible to\nget. We compare modeling human context via group attributes, individual users,\nand combined approaches. Combining group and individual features significantly\nbenefits user-level regression tasks like age estimation or personality\nassessment from a user's documents. Modeling individual users significantly\nimproves the performance of single document-level classification tasks like\nstance and topic detection. We also find that individual-user modeling does\nwell even without user's historical data.\n","authors":["Nikita Soni","Niranjan Balasubramanian","H. Andrew Schwartz","Dirk Hovy"],"pdf_url":"https://arxiv.org/pdf/2401.12492v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12491v1","updated":"2024-01-23T05:19:47Z","published":"2024-01-23T05:19:47Z","title":"Assessing and Understanding Creativity in Large Language Models","summary":" In the field of natural language processing, the rapid development of large\nlanguage model (LLM) has attracted more and more attention. LLMs have shown a\nhigh level of creativity in various tasks, but the methods for assessing such\ncreativity are inadequate. The assessment of LLM creativity needs to consider\ndifferences from humans, requiring multi-dimensional measurement while\nbalancing accuracy and efficiency. This paper aims to establish an efficient\nframework for assessing the level of creativity in LLMs. By adapting the\nmodified Torrance Tests of Creative Thinking, the research evaluates the\ncreative performance of various LLMs across 7 tasks, emphasizing 4 criteria\nincluding Fluency, Flexibility, Originality, and Elaboration. In this context,\nwe develop a comprehensive dataset of 700 questions for testing and an\nLLM-based evaluation method. In addition, this study presents a novel analysis\nof LLMs' responses to diverse prompts and role-play situations. We found that\nthe creativity of LLMs primarily falls short in originality, while excelling in\nelaboration. Besides, the use of prompts and the role-play settings of the\nmodel significantly influence creativity. Additionally, the experimental\nresults also indicate that collaboration among multiple LLMs can enhance\noriginality. Notably, our findings reveal a consensus between human evaluations\nand LLMs regarding the personality traits that influence creativity. The\nfindings underscore the significant impact of LLM design on creativity and\nbridges artificial intelligence and human creativity, offering insights into\nLLMs' creativity and potential applications.\n","authors":["Yunpu Zhao","Rui Zhang","Wenyi Li","Di Huang","Jiaming Guo","Shaohui Peng","Yifan Hao","Yuanbo Wen","Xing Hu","Zidong Du","Qi Guo","Ling Li","Yunji Chen"],"pdf_url":"https://arxiv.org/pdf/2401.12491v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.10225v2","updated":"2024-01-23T05:04:32Z","published":"2024-01-18T18:59:11Z","title":"ChatQA: Building GPT-4 Level Conversational QA Models","summary":" In this work, we introduce ChatQA, a family of conversational question\nanswering (QA) models that obtain GPT-4 level accuracies. Specifically, we\npropose a two-stage instruction tuning method that can significantly improve\nthe zero-shot conversational QA results from large language models (LLMs). To\nhandle retrieval-augmented generation in conversational QA, we fine-tune a\ndense retriever on a multi-turn QA dataset, which provides comparable results\nto using the state-of-the-art query rewriting model while largely reducing\ndeployment cost. Notably, our ChatQA-70B can outperform GPT-4 in terms of\naverage score on 10 conversational QA datasets (54.14 vs. 53.90), without\nrelying on any synthetic data from OpenAI GPT models.\n","authors":["Zihan Liu","Wei Ping","Rajarshi Roy","Peng Xu","Chankyu Lee","Mohammad Shoeybi","Bryan Catanzaro"],"pdf_url":"https://arxiv.org/pdf/2401.10225v2.pdf","comment":"We added ChatQA-22B results"},{"id":"http://arxiv.org/abs/2305.09781v3","updated":"2024-01-23T05:02:03Z","published":"2023-05-16T20:12:59Z","title":"SpecInfer: Accelerating Generative Large Language Model Serving with\n Tree-based Speculative Inference and Verification","summary":" This paper introduces SpecInfer, a system that accelerates generative large\nlanguage model (LLM) serving with tree-based speculative inference and\nverification. The key idea behind SpecInfer is leveraging small speculative\nmodels to predict the LLM's outputs; the predictions are organized as a token\ntree, whose nodes each represent a candidate token sequence. The correctness of\nall candidate token sequences represented by a token tree is verified against\nthe LLM in parallel using a novel tree-based parallel decoding mechanism.\nSpecInfer uses an LLM as a token tree verifier instead of an incremental\ndecoder, which significantly reduces the end-to-end latency and computational\nrequirement for serving generative LLMs while provably preserving model\nquality. Our evaluation shows that SpecInfer outperforms existing LLM serving\nsystems by 1.5-2.8x for distributed LLM inference and by 2.6-3.5x for\noffloading-based LLM inference, while preserving the same generative\nperformance. SpecInfer is publicly available at\nhttps://github.com/flexflow/FlexFlow/\n","authors":["Xupeng Miao","Gabriele Oliaro","Zhihao Zhang","Xinhao Cheng","Zeyu Wang","Zhengxin Zhang","Rae Ying Yee Wong","Alan Zhu","Lijie Yang","Xiaoxiang Shi","Chunan Shi","Zhuoming Chen","Daiyaan Arfeen","Reyna Abhyankar","Zhihao Jia"],"pdf_url":"https://arxiv.org/pdf/2305.09781v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.07213v2","updated":"2024-01-23T04:59:29Z","published":"2023-08-14T15:31:32Z","title":"Human-centered NLP Fact-checking: Co-Designing with Fact-checkers using\n Matchmaking for AI","summary":" While many Natural Language Processing (NLP) techniques have been proposed\nfor fact-checking, both academic research and fact-checking organizations\nreport limited adoption of such NLP work due to poor alignment with\nfact-checker practices, values, and needs. To address this, we investigate a\nco-design method, Matchmaking for AI, to enable fact-checkers, designers, and\nNLP researchers to collaboratively identify what fact-checker needs should be\naddressed by technology, and to brainstorm ideas for potential solutions.\nCo-design sessions we conducted with 22 professional fact-checkers yielded a\nset of 11 design ideas that offer a \"north star\", integrating fact-checker\ncriteria into novel NLP design concepts. These concepts range from pre-bunking\nmisinformation, efficient and personalized monitoring misinformation,\nproactively reducing fact-checker potential biases, and collaborative writing\nfact-check reports. Our work provides new insights into both human-centered\nfact-checking research and practice and AI co-design research.\n","authors":["Houjiang Liu","Anubrata Das","Alexander Boltz","Didi Zhou","Daisy Pinaroc","Matthew Lease","Min Kyung Lee"],"pdf_url":"https://arxiv.org/pdf/2308.07213v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.02994v3","updated":"2024-01-23T04:43:56Z","published":"2024-01-04T07:45:49Z","title":"Blending Is All You Need: Cheaper, Better Alternative to\n Trillion-Parameters LLM","summary":" In conversational AI research, there's a noticeable trend towards developing\nmodels with a larger number of parameters, exemplified by models like ChatGPT.\nWhile these expansive models tend to generate increasingly better chat\nresponses, they demand significant computational resources and memory. This\nstudy explores a pertinent question: Can a combination of smaller models\ncollaboratively achieve comparable or enhanced performance relative to a\nsingular large model? We introduce an approach termed \"blending\", a\nstraightforward yet effective method of integrating multiple chat AIs. Our\nempirical evidence suggests that when specific smaller models are\nsynergistically blended, they can potentially outperform or match the\ncapabilities of much larger counterparts. For instance, integrating just three\nmodels of moderate size (6B/13B paramaeters) can rival or even surpass the\nperformance metrics of a substantially larger model like ChatGPT (175B+\nparamaters). This hypothesis is rigorously tested using A/B testing\nmethodologies with a large user base on the Chai research platform over a span\nof thirty days. The findings underscore the potential of the \"blending\"\nstrategy as a viable approach for enhancing chat AI efficacy without a\ncorresponding surge in computational demands.\n","authors":["Xiaoding Lu","Zongyi Liu","Adian Liusie","Vyas Raina","Vineet Mudupalli","Yuwen Zhang","William Beauchamp"],"pdf_url":"https://arxiv.org/pdf/2401.02994v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12474v1","updated":"2024-01-23T03:56:22Z","published":"2024-01-23T03:56:22Z","title":"Large Language Models are Superpositions of All Characters: Attaining\n Arbitrary Role-play via Self-Alignment","summary":" Considerable efforts have been invested in augmenting the role-playing\nproficiency of open-source large language models (LLMs) by emulating\nproprietary counterparts. Nevertheless, we posit that LLMs inherently harbor\nrole-play capabilities, owing to the extensive knowledge of characters and\npotential dialogues ingrained in their vast training corpora. Thus, in this\nstudy, we introduce Ditto, a self-alignment method for role-play. Ditto\ncapitalizes on character knowledge, encouraging an instruction-following LLM to\nsimulate role-play dialogues as a variant of reading comprehension. This method\ncreates a role-play training set comprising 4,000 characters, surpassing the\nscale of currently available datasets by tenfold regarding the number of roles.\nSubsequently, we fine-tune the LLM using this self-generated dataset to augment\nits role-playing capabilities. Upon evaluating our meticulously constructed and\nreproducible role-play benchmark and the roleplay subset of MT-Bench, Ditto, in\nvarious parameter scales, consistently maintains a consistent role identity and\nprovides accurate role-specific knowledge in multi-turn role-play\nconversations. Notably, it outperforms all open-source role-play baselines,\nshowcasing performance levels comparable to advanced proprietary chatbots.\nFurthermore, we present the first comprehensive cross-supervision alignment\nexperiment in the role-play domain, revealing that the intrinsic capabilities\nof LLMs confine the knowledge within role-play. Meanwhile, the role-play styles\ncan be easily acquired with the guidance of smaller models. We open-source\nrelated resources at https://github.com/OFA-Sys/Ditto.\n","authors":["Keming Lu","Bowen Yu","Chang Zhou","Jingren Zhou"],"pdf_url":"https://arxiv.org/pdf/2401.12474v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12472v1","updated":"2024-01-23T03:47:07Z","published":"2024-01-23T03:47:07Z","title":"Contrastive Learning in Distilled Models","summary":" Natural Language Processing models like BERT can provide state-of-the-art\nword embeddings for downstream NLP tasks. However, these models yet to perform\nwell on Semantic Textual Similarity, and may be too large to be deployed as\nlightweight edge applications. We seek to apply a suitable contrastive learning\nmethod based on the SimCSE paper, to a model architecture adapted from a\nknowledge distillation based model, DistilBERT, to address these two issues.\nOur final lightweight model DistilFace achieves an average of 72.1 in\nSpearman's correlation on STS tasks, a 34.2 percent improvement over BERT base.\n","authors":["Valerie Lim","Kai Wen Ng","Kenneth Lim"],"pdf_url":"https://arxiv.org/pdf/2401.12472v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11624v2","updated":"2024-01-23T03:35:40Z","published":"2024-01-21T23:34:42Z","title":"In-context Learning with Retrieved Demonstrations for Language Models: A\n Survey","summary":" Language models, especially pre-trained large language models, have showcased\nremarkable abilities as few-shot in-context learners (ICL), adept at adapting\nto new tasks with just a few demonstrations in the input context. However, the\nmodel's ability to perform ICL is sensitive to the choice of the few-shot\ndemonstrations. Instead of using a fixed set of demonstrations, one recent\ndevelopment is to retrieve demonstrations tailored to each input query. The\nimplementation of demonstration retrieval is relatively straightforward,\nleveraging existing databases and retrieval systems. This not only improves the\nefficiency and scalability of the learning process but also has been shown to\nreduce biases inherent in manual example selection. In light of the encouraging\nresults and growing research in ICL with retrieved demonstrations, we conduct\nan extensive review of studies in this area. In this survey, we discuss and\ncompare different design choices for retrieval models, retrieval training\nprocedures, and inference algorithms.\n","authors":["Man Luo","Xin Xu","Yue Liu","Panupong Pasupat","Mehran Kazemi"],"pdf_url":"https://arxiv.org/pdf/2401.11624v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11033v2","updated":"2024-01-23T03:30:11Z","published":"2024-01-19T21:21:02Z","title":"FAIR Enough: How Can We Develop and Assess a FAIR-Compliant Dataset for\n Large Language Models' Training?","summary":" The rapid evolution of Large Language Models (LLMs) underscores the critical\nimportance of ethical considerations and data integrity in AI development,\nemphasizing the role of FAIR (Findable, Accessible, Interoperable, Reusable)\ndata principles. While these principles have long been a cornerstone of ethical\ndata stewardship, their application in LLM training data is less prevalent, an\nissue our research aims to address. Our study begins with a review of existing\nliterature, highlighting the significance of FAIR principles in data management\nfor model training. Building on this foundation, we introduce a novel framework\nthat incorporates FAIR principles into the LLM training process. A key aspect\nof this approach is a comprehensive checklist, designed to assist researchers\nand developers in consistently applying FAIR data principles throughout the\nmodel development lifecycle. The practicality and effectiveness of our\nframework are demonstrated through a case study that involves creating a\nFAIR-compliant dataset to detect and reduce biases. This case study not only\nvalidates the usefulness of our framework but also establishes new benchmarks\nfor more equitable, transparent, and ethical practices in LLM training. We\noffer this framework to the community as a means to promote technologically\nadvanced, ethically sound, and socially responsible AI models.\n","authors":["Shaina Raza","Shardul Ghuge","Chen Ding","Deval Pandya"],"pdf_url":"https://arxiv.org/pdf/2401.11033v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.04691v5","updated":"2024-01-23T03:25:22Z","published":"2023-10-07T05:37:41Z","title":"EMO: Earth Mover Distance Optimization for Auto-Regressive Language\n Modeling","summary":" Neural language models are probabilistic models of human text. They are\npredominantly trained using maximum likelihood estimation (MLE), which is\nequivalent to minimizing the forward cross-entropy between the empirical data\ndistribution and the model distribution. However, various degeneration\nphenomena are still widely observed when decoding from the distributions\nlearned by such models. We establish that the forward cross-entropy is\nsuboptimal as a distance metric for aligning human and model distribution due\nto its (1) recall-prioritization (2) negative diversity ignorance and (3)\ntrain-test mismatch. In this paper, we propose Earth Mover Distance\nOptimization (EMO) for auto-regressive language modeling. EMO capitalizes on\nthe inherent properties of earth mover distance to address the aforementioned\nchallenges. Due to the high complexity of direct computation, we further\nintroduce a feasible upper bound for EMO to ease end-to-end training. Upon\nextensive evaluation of language models trained using EMO and MLE. We find that\nEMO demonstrates a consistently better language modeling performance than MLE\nacross domains. Moreover, EMO demonstrates noteworthy enhancements in\ndownstream performance with minimal fine-tuning on merely 25,000 sentences.\nThis highlights the tremendous potential of EMO as a lightweight calibration\nmethod for enhancing large-scale pre-trained language models.\n","authors":["Siyu Ren","Zhiyong Wu","Kenny Q. Zhu"],"pdf_url":"https://arxiv.org/pdf/2310.04691v5.pdf","comment":"To appear at ICLR 2024"},{"id":"http://arxiv.org/abs/2401.12461v1","updated":"2024-01-23T03:03:57Z","published":"2024-01-23T03:03:57Z","title":"Fast Adversarial Training against Textual Adversarial Attacks","summary":" Many adversarial defense methods have been proposed to enhance the\nadversarial robustness of natural language processing models. However, most of\nthem introduce additional pre-set linguistic knowledge and assume that the\nsynonym candidates used by attackers are accessible, which is an ideal\nassumption. We delve into adversarial training in the embedding space and\npropose a Fast Adversarial Training (FAT) method to improve the model\nrobustness in the synonym-unaware scenario from the perspective of single-step\nperturbation generation and perturbation initialization. Based on the\nobservation that the adversarial perturbations crafted by single-step and\nmulti-step gradient ascent are similar, FAT uses single-step gradient ascent to\ncraft adversarial examples in the embedding space to expedite the training\nprocess. Based on the observation that the perturbations generated on the\nidentical training sample in successive epochs are similar, FAT fully utilizes\nhistorical information when initializing the perturbation. Extensive\nexperiments demonstrate that FAT significantly boosts the robustness of BERT\nmodels in the synonym-unaware scenario, and outperforms the defense baselines\nunder various attacks with character-level and word-level modifications.\n","authors":["Yichen Yang","Xin Liu","Kun He"],"pdf_url":"https://arxiv.org/pdf/2401.12461v1.pdf","comment":"4 pages, 4 figures"},{"id":"http://arxiv.org/abs/2309.09552v3","updated":"2024-01-23T02:59:44Z","published":"2023-09-18T08:03:54Z","title":"A Multitask Training Approach to Enhance Whisper with Contextual Biasing\n and Open-Vocabulary Keyword Spotting","summary":" End-to-end automatic speech recognition (ASR) systems often struggle to\nrecognize rare name entities, such as personal names, organizations, and\nterminologies not frequently encountered in the training data. This paper\npresents Contextual Biasing Whisper (CB-Whisper), a novel ASR system based on\nOpenAI's Whisper model that can recognize user-defined name entities by\nperforming open-vocabulary keyword-spotting (OV-KWS) using the hidden states of\nWhisper encoder. The recognized entities are used as prompts for the Whisper\ndecoder. We first propose a multitask training approach with OV-KWS and ASR\ntasks to optimize the model. Experiments show that this approach substantially\nimproves the entity recalls compared to the original Whisper model on Chinese\nAishell hot word subsets and two internal code-switch test sets. However, we\nobserved a slight increase in mixed-error-rate (MER) on internal test sets due\nto catastrophic forgetting. To address this problem and use different sizes of\nthe Whisper model without finetuning, we propose to use OV-KWS as a separate\nmodule and construct a spoken form prompt to prevent hallucination. The OV-KWS\nmodule consistently improves MER and Entity Recall for whisper-small, medium,\nand large models.\n","authors":["Yuang Li","Yinglu Li","Min Zhang","Chang Su","Mengxin Ren","Xiaosong Qiao","Xiaofeng Zhao","Mengyao Piao","Jiawei Yu","Xinglin Lv","Miaomiao Ma","Yanqing Zhao","Hao Yang"],"pdf_url":"https://arxiv.org/pdf/2309.09552v3.pdf","comment":"5 pages, 2 figures"},{"id":"http://arxiv.org/abs/2305.02317v3","updated":"2024-01-23T02:29:35Z","published":"2023-05-03T17:58:29Z","title":"Visual Chain of Thought: Bridging Logical Gaps with Multimodal\n Infillings","summary":" Recent advances in large language models elicit reasoning in a\nchain-of-thought that allows models to decompose problems in a human-like\nfashion. Though this paradigm improves multi-step reasoning ability in language\nmodels, it is limited by being unimodal and applied mainly to\nquestion-answering tasks. We claim that incorporating visual augmentation into\nreasoning is essential, especially for complex, imaginative tasks.\nConsequently, we introduce VCoT, a novel method that leverages chain-of-thought\nprompting with vision-language grounding to recursively bridge the logical gaps\nwithin sequential data. Our method uses visual guidance to generate synthetic\nmultimodal infillings that add consistent and novel information to reduce the\nlogical gaps for downstream tasks that can benefit from temporal reasoning, as\nwell as provide interpretability into models' multi-step reasoning. We apply\nVCoT to the Visual Storytelling and WikiHow summarization datasets and\ndemonstrate through human evaluation that VCoT offers novel and consistent\nsynthetic data augmentation beating chain-of-thought baselines, which can be\nused to enhance downstream performance.\n","authors":["Daniel Rose","Vaishnavi Himakunthala","Andy Ouyang","Ryan He","Alex Mei","Yujie Lu","Michael Saxon","Chinmay Sonar","Diba Mirza","William Yang Wang"],"pdf_url":"https://arxiv.org/pdf/2305.02317v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.13010v2","updated":"2024-01-23T02:12:35Z","published":"2023-12-20T13:22:41Z","title":"AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and\n Optimisation","summary":" The advancement of natural language processing (NLP) has been significantly\nboosted by the development of transformer-based large language models (LLMs).\nThese models have revolutionized NLP tasks, particularly in code generation,\naiding developers in creating software with enhanced efficiency. Despite their\nadvancements, challenges in balancing code snippet generation with effective\ntest case generation and execution persist. To address these issues, this paper\nintroduces Multi-Agent Assistant Code Generation (AgentCoder), a novel solution\ncomprising a multi-agent framework with specialized agents: the programmer\nagent, the test designer agent, and the test executor agent. During the coding\nprocedure, the programmer agent will focus on the code generation and\nrefinement based on the test executor agent's feedback. The test designer agent\nwill generate test cases for the generated code, and the test executor agent\nwill run the code with the test cases and write the feedback to the programmer.\nThis collaborative system ensures robust code generation, surpassing the\nlimitations of single-agent models and traditional methodologies. Our extensive\nexperiments on 9 code generation models and 12 enhancement approaches showcase\nAgentCoder's superior performance over existing code generation models and\nprompt engineering techniques across various benchmarks. For example,\nAgentCoder achieves 77.4% and 89.1% pass@1 in HumanEval-ET and MBPP-ET with\nGPT-3.5, while SOTA baselines obtain only 69.5% and 63.0%.\n","authors":["Dong Huang","Qingwen Bu","Jie M. Zhang","Michael Luck","Heming Cui"],"pdf_url":"https://arxiv.org/pdf/2312.13010v2.pdf","comment":"21 pages, 12 figures"},{"id":"http://arxiv.org/abs/2308.16692v2","updated":"2024-01-23T01:56:57Z","published":"2023-08-31T12:53:09Z","title":"SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language\n Models","summary":" Current speech large language models build upon discrete speech\nrepresentations, which can be categorized into semantic tokens and acoustic\ntokens. However, existing speech tokens are not specifically designed for\nspeech language modeling. To assess the suitability of speech tokens for\nbuilding speech language models, we established the first benchmark,\nSLMTokBench. Our results indicate that neither semantic nor acoustic tokens are\nideal for this purpose. Therefore, we propose SpeechTokenizer, a unified speech\ntokenizer for speech large language models. SpeechTokenizer adopts the\nEncoder-Decoder architecture with residual vector quantization (RVQ). Unifying\nsemantic and acoustic tokens, SpeechTokenizer disentangles different aspects of\nspeech information hierarchically across different RVQ layers. Furthermore, We\nconstruct a Unified Speech Language Model (USLM) leveraging SpeechTokenizer.\nExperiments show that SpeechTokenizer performs comparably to EnCodec in speech\nreconstruction and demonstrates strong performance on the SLMTokBench\nbenchmark. Also, USLM outperforms VALL-E in zero-shot Text-to-Speech tasks.\nCode and models are available at\nhttps://github.com/ZhangXInFD/SpeechTokenizer/.\n","authors":["Xin Zhang","Dong Zhang","Shimin Li","Yaqian Zhou","Xipeng Qiu"],"pdf_url":"https://arxiv.org/pdf/2308.16692v2.pdf","comment":"Accepted by ICLR 2024. Project page is at\n https://0nutation.github.io/SpeechTokenizer.github.io/"},{"id":"http://arxiv.org/abs/2401.12428v1","updated":"2024-01-23T01:33:09Z","published":"2024-01-23T01:33:09Z","title":"CIM-MLC: A Multi-level Compilation Stack for Computing-In-Memory\n Accelerators","summary":" In recent years, various computing-in-memory (CIM) processors have been\npresented, showing superior performance over traditional architectures. To\nunleash the potential of various CIM architectures, such as device precision,\ncrossbar size, and crossbar number, it is necessary to develop compilation\ntools that are fully aware of the CIM architectural details and implementation\ndiversity. However, due to the lack of architectural support in current popular\nopen-source compiling stacks, existing CIM designs either manually deploy\nnetworks or build their own compilers, which is time-consuming and\nlabor-intensive. Although some works expose the specific CIM device programming\ninterfaces to compilers, they are often bound to a fixed CIM architecture,\nlacking the flexibility to support the CIM architectures with different\ncomputing granularity. On the other hand, existing compilation works usually\nconsider the scheduling of limited operation types (such as crossbar-bound\nmatrix-vector multiplication). Unlike conventional processors, CIM accelerators\nare featured by their diverse architecture, circuit, and device, which cannot\nbe simply abstracted by a single level if we seek to fully explore the\nadvantages brought by CIM. Therefore, we propose CIM-MLC, a universal\nmulti-level compilation framework for general CIM architectures. We first\nestablish a general hardware abstraction for CIM architectures and computing\nmodes to represent various CIM accelerators. Based on the proposed abstraction,\nCIM-MLC can compile tasks onto a wide range of CIM accelerators having\ndifferent devices, architectures, and programming interfaces. More importantly,\ncompared with existing compilation work, CIM-MLC can explore the mapping and\nscheduling strategies across multiple architectural tiers, which form a\ntractable yet effective design space, to achieve better scheduling and\ninstruction generation results.\n","authors":["Songyun Qu","Shixin Zhao","Bing Li","Yintao He","Xuyi Cai","Lei Zhang","Ying Wang"],"pdf_url":"https://arxiv.org/pdf/2401.12428v1.pdf","comment":"16 pages, 22 figures"},{"id":"http://arxiv.org/abs/2401.12425v1","updated":"2024-01-23T01:25:00Z","published":"2024-01-23T01:25:00Z","title":"The Neglected Tails of Vision-Language Models","summary":" Vision-language models (VLMs) excel in zero-shot recognition but exhibit\ndrastically imbalanced performance across visual concepts. For example, CLIP,\ndespite an impressive mean zero-shot accuracy on ImageNet (72.7%), yields\n$<$10% on ten concepts (e.g., gyromitra and night snake), presumably, because\nthese concepts are under-represented in VLMs' imbalanced pretraining data. Yet,\nassessing this imbalance is challenging as it is non-trivial to calculate the\nfrequency of specific concepts within VLMs' large-scale pretraining data. Our\nwork makes the first attempt to measure the concept frequency by analyzing\npretraining texts. We use off-the-shelf language models to help count relevant\ntexts that contain synonyms of the given concepts and resolve linguistic\nambiguity. We confirm that popular VLM datasets like LAION indeed exhibit\nlong-tailed concept distributions, which strongly correlate with per-class\naccuracies. Further, contemporary multimodal systems, e.g., visual chatbots and\ntext-to-image generators, also struggle with the rare concepts identified by\nour method. To mitigate VLMs' imbalanced performance in zero-shot recognition,\nwe propose REtrieval-Augmented Learning REAL. First, instead of prompting VLMs\nusing the original class names, REAL uses their most frequent synonyms found in\nVLMs' pretraining texts. This already outperforms human-engineered and\nLLM-generated prompts over nine benchmark datasets, likely because VLMs have\nseen more images associated with the frequently used synonyms. Second, REAL\nuses all the concept synonyms to retrieve a small, class-balanced set of\npretraining data to train a robust classifier. REAL surpasses the recent\nretrieval-augmented solution REACT, using 400x less storage and 10,000x less\ntraining time!\n","authors":["Shubham Parashar","Zhiqiu Lin","Tian Liu","Xiangjue Dong","Yanan Li","Deva Ramanan","James Caverlee","Shu Kong"],"pdf_url":"https://arxiv.org/pdf/2401.12425v1.pdf","comment":"Project Page:\n https://shubhamprshr27.github.io/neglected-tails-of-vlms/"},{"id":"http://arxiv.org/abs/2401.13146v1","updated":"2024-01-23T23:46:01Z","published":"2024-01-23T23:46:01Z","title":"Locality enhanced dynamic biasing and sampling strategies for contextual\n ASR","summary":" Automatic Speech Recognition (ASR) still face challenges when recognizing\ntime-variant rare-phrases. Contextual biasing (CB) modules bias ASR model\ntowards such contextually-relevant phrases. During training, a list of biasing\nphrases are selected from a large pool of phrases following a sampling\nstrategy. In this work we firstly analyse different sampling strategies to\nprovide insights into the training of CB for ASR with correlation plots between\nthe bias embeddings among various training stages. Secondly, we introduce a\nneighbourhood attention (NA) that localizes self attention (SA) to the nearest\nneighbouring frames to further refine the CB output. The results show that this\nproposed approach provides on average a 25.84% relative WER improvement on\nLibriSpeech sets and rare-word evaluation compared to the baseline.\n","authors":["Md Asif Jalal","Pablo Peso Parada","George Pavlidis","Vasileios Moschopoulos","Karthikeyan Saravanan","Chrysovalantis-Giorgos Kontoulis","Jisi Zhang","Anastasios Drosou","Gil Ho Lee","Jungin Lee","Seokyeong Jung"],"pdf_url":"https://arxiv.org/pdf/2401.13146v1.pdf","comment":"Accepted for IEEE ASRU 2023"},{"id":"http://arxiv.org/abs/2309.06657v2","updated":"2024-01-23T23:16:11Z","published":"2023-09-13T01:07:25Z","title":"Statistical Rejection Sampling Improves Preference Optimization","summary":" Improving the alignment of language models with human preferences remains an\nactive research challenge. Previous approaches have primarily utilized\nReinforcement Learning from Human Feedback (RLHF) via online RL methods such as\nProximal Policy Optimization (PPO). Recently, offline methods such as Sequence\nLikelihood Calibration (SLiC) and Direct Preference Optimization (DPO) have\nemerged as attractive alternatives, offering improvements in stability and\nscalability while maintaining competitive performance. SLiC refines its loss\nfunction using sequence pairs sampled from a supervised fine-tuned (SFT)\npolicy, while DPO directly optimizes language models based on preference data,\nforegoing the need for a separate reward model. However, the maximum likelihood\nestimator (MLE) of the target optimal policy requires labeled preference pairs\nsampled from that policy. DPO's lack of a reward model constrains its ability\nto sample preference pairs from the optimal policy, and SLiC is restricted to\nsampling preference pairs only from the SFT policy. To address these\nlimitations, we introduce a novel approach called Statistical Rejection\nSampling Optimization (RSO) that aims to source preference data from the target\noptimal policy using rejection sampling, enabling a more accurate estimation of\nthe optimal policy. We also propose a unified framework that enhances the loss\nfunctions used in both SLiC and DPO from a preference modeling standpoint.\nThrough extensive experiments across three diverse tasks, we demonstrate that\nRSO consistently outperforms both SLiC and DPO on evaluations from both Large\nLanguage Model (LLM) and human raters.\n","authors":["Tianqi Liu","Yao Zhao","Rishabh Joshi","Misha Khalman","Mohammad Saleh","Peter J. Liu","Jialu Liu"],"pdf_url":"https://arxiv.org/pdf/2309.06657v2.pdf","comment":"Accepted in ICLR 2024"},{"id":"http://arxiv.org/abs/2401.13136v1","updated":"2024-01-23T23:12:09Z","published":"2024-01-23T23:12:09Z","title":"The Language Barrier: Dissecting Safety Challenges of LLMs in\n Multilingual Contexts","summary":" As the influence of large language models (LLMs) spans across global\ncommunities, their safety challenges in multilingual settings become paramount\nfor alignment research. This paper examines the variations in safety challenges\nfaced by LLMs across different languages and discusses approaches to\nalleviating such concerns. By comparing how state-of-the-art LLMs respond to\nthe same set of malicious prompts written in higher- vs. lower-resource\nlanguages, we observe that (1) LLMs tend to generate unsafe responses much more\noften when a malicious prompt is written in a lower-resource language, and (2)\nLLMs tend to generate more irrelevant responses to malicious prompts in\nlower-resource languages. To understand where the discrepancy can be\nattributed, we study the effect of instruction tuning with reinforcement\nlearning from human feedback (RLHF) or supervised finetuning (SFT) on the\nHH-RLHF dataset. Surprisingly, while training with high-resource languages\nimproves model alignment, training in lower-resource languages yields minimal\nimprovement. This suggests that the bottleneck of cross-lingual alignment is\nrooted in the pretraining stage. Our findings highlight the challenges in\ncross-lingual LLM safety, and we hope they inform future research in this\ndirection.\n","authors":["Lingfeng Shen","Weiting Tan","Sihao Chen","Yunmo Chen","Jingyu Zhang","Haoran Xu","Boyuan Zheng","Philipp Koehn","Daniel Khashabi"],"pdf_url":"https://arxiv.org/pdf/2401.13136v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.13133v1","updated":"2024-01-23T22:49:19Z","published":"2024-01-23T22:49:19Z","title":"Analyzing COVID-19 Vaccination Sentiments in Nigerian Cyberspace:\n Insights from a Manually Annotated Twitter Dataset","summary":" Numerous successes have been achieved in combating the COVID-19 pandemic,\ninitially using various precautionary measures like lockdowns, social\ndistancing, and the use of face masks. More recently, various vaccinations have\nbeen developed to aid in the prevention or reduction of the severity of the\nCOVID-19 infection. Despite the effectiveness of the precautionary measures and\nthe vaccines, there are several controversies that are massively shared on\nsocial media platforms like Twitter. In this paper, we explore the use of\nstate-of-the-art transformer-based language models to study people's acceptance\nof vaccines in Nigeria. We developed a novel dataset by crawling multi-lingual\ntweets using relevant hashtags and keywords. Our analysis and visualizations\nrevealed that most tweets expressed neutral sentiments about COVID-19 vaccines,\nwith some individuals expressing positive views, and there was no strong\npreference for specific vaccine types, although Moderna received slightly more\npositive sentiment. We also found out that fine-tuning a pre-trained LLM with\nan appropriate dataset can yield competitive results, even if the LLM was not\ninitially pre-trained on the specific language of that dataset.\n","authors":["Ibrahim Said Ahmad","Lukman Jibril Aliyu","Abubakar Auwal Khalid","Saminu Muhammad Aliyu","Shamsuddeen Hassan Muhammad","Idris Abdulmumin","Bala Mairiga Abduljalil","Bello Shehu Bello","Amina Imam Abubakar"],"pdf_url":"https://arxiv.org/pdf/2401.13133v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.06373v2","updated":"2024-01-23T22:46:12Z","published":"2024-01-12T16:13:24Z","title":"How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to\n Challenge AI Safety by Humanizing LLMs","summary":" Most traditional AI safety research has approached AI models as machines and\ncentered on algorithm-focused attacks developed by security experts. As large\nlanguage models (LLMs) become increasingly common and competent, non-expert\nusers can also impose risks during daily interactions. This paper introduces a\nnew perspective to jailbreak LLMs as human-like communicators, to explore this\noverlooked intersection between everyday language interaction and AI safety.\nSpecifically, we study how to persuade LLMs to jailbreak them. First, we\npropose a persuasion taxonomy derived from decades of social science research.\nThen, we apply the taxonomy to automatically generate interpretable persuasive\nadversarial prompts (PAP) to jailbreak LLMs. Results show that persuasion\nsignificantly increases the jailbreak performance across all risk categories:\nPAP consistently achieves an attack success rate of over $92\\%$ on Llama 2-7b\nChat, GPT-3.5, and GPT-4 in $10$ trials, surpassing recent algorithm-focused\nattacks. On the defense side, we explore various mechanisms against PAP and,\nfound a significant gap in existing defenses, and advocate for more fundamental\nmitigation for highly interactive LLMs\n","authors":["Yi Zeng","Hongpeng Lin","Jingwen Zhang","Diyi Yang","Ruoxi Jia","Weiyan Shi"],"pdf_url":"https://arxiv.org/pdf/2401.06373v2.pdf","comment":"14 pages of the main text, qualitative examples of jailbreaks may be\n harmful in nature"},{"id":"http://arxiv.org/abs/2401.13129v1","updated":"2024-01-23T22:36:03Z","published":"2024-01-23T22:36:03Z","title":"Seed-Guided Fine-Grained Entity Typing in Science and Engineering\n Domains","summary":" Accurately typing entity mentions from text segments is a fundamental task\nfor various natural language processing applications. Many previous approaches\nrely on massive human-annotated data to perform entity typing. Nevertheless,\ncollecting such data in highly specialized science and engineering domains\n(e.g., software engineering and security) can be time-consuming and costly,\nwithout mentioning the domain gaps between training and inference data if the\nmodel needs to be applied to confidential datasets. In this paper, we study the\ntask of seed-guided fine-grained entity typing in science and engineering\ndomains, which takes the name and a few seed entities for each entity type as\nthe only supervision and aims to classify new entity mentions into both seen\nand unseen types (i.e., those without seed entities). To solve this problem, we\npropose SEType which first enriches the weak supervision by finding more\nentities for each seen type from an unlabeled corpus using the contextualized\nrepresentations of pre-trained language models. It then matches the enriched\nentities to unlabeled text to get pseudo-labeled samples and trains a textual\nentailment model that can make inferences for both seen and unseen types.\nExtensive experiments on two datasets covering four domains demonstrate the\neffectiveness of SEType in comparison with various baselines.\n","authors":["Yu Zhang","Yunyi Zhang","Yanzhen Shen","Yu Deng","Lucian Popa","Larisa Shwartz","ChengXiang Zhai","Jiawei Han"],"pdf_url":"https://arxiv.org/pdf/2401.13129v1.pdf","comment":"9 pages; Accepted to AAAI 2024 (Code:\n https://github.com/yuzhimanhua/SEType)"},{"id":"http://arxiv.org/abs/2202.12312v2","updated":"2024-01-23T22:09:07Z","published":"2022-02-24T19:00:39Z","title":"Oolong: Investigating What Makes Transfer Learning Hard with Controlled\n Studies","summary":" When we transfer a pretrained language model to a new language, there are\nmany axes of variation that change at once. To disentangle the impact of\ndifferent factors like syntactic similarity and vocabulary similarity, we\npropose a set of controlled transfer studies: we systematically transform the\nlanguage of the GLUE benchmark, altering one axis of crosslingual variation at\na time, and then measure the resulting drops in a pretrained model's downstream\nperformance. We find that models can largely recover from syntactic-style\nshifts, but cannot recover from vocabulary misalignment and embedding matrix\nre-initialization, even with continued pretraining on 15 million tokens. %On\nthe other hand, transferring to a dataset with an unaligned vocabulary is\nextremely hard to recover from in the low-data regime. Moreover, good-quality\ntokenizers in the transfer language do not make vocabulary alignment easier.\nOur experiments provide insights into the factors of cross-lingual transfer\nthat researchers should most focus on when designing language transfer\nscenarios.\n","authors":["Zhengxuan Wu","Alex Tamkin","Isabel Papadimitriou"],"pdf_url":"https://arxiv.org/pdf/2202.12312v2.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2303.13716v2","updated":"2024-01-23T21:52:42Z","published":"2023-03-24T00:01:24Z","title":"ReCOGS: How Incidental Details of a Logical Form Overshadow an\n Evaluation of Semantic Interpretation","summary":" Compositional generalization benchmarks for semantic parsing seek to assess\nwhether models can accurately compute meanings for novel sentences, but\noperationalize this in terms of logical form (LF) prediction. This raises the\nconcern that semantically irrelevant details of the chosen LFs could shape\nmodel performance. We argue that this concern is realized for the COGS\nbenchmark. COGS poses generalization splits that appear impossible for\npresent-day models, which could be taken as an indictment of those models.\nHowever, we show that the negative results trace to incidental features of COGS\nLFs. Converting these LFs to semantically equivalent ones and factoring out\ncapabilities unrelated to semantic interpretation, we find that even baseline\nmodels get traction. A recent variable-free translation of COGS LFs suggests\nsimilar conclusions, but we observe this format is not semantically equivalent;\nit is incapable of accurately representing some COGS meanings. These findings\ninform our proposal for ReCOGS, a modified version of COGS that comes closer to\nassessing the target semantic capabilities while remaining very challenging.\nOverall, our results reaffirm the importance of compositional generalization\nand careful benchmark task design.\n","authors":["Zhengxuan Wu","Christopher D. Manning","Christopher Potts"],"pdf_url":"https://arxiv.org/pdf/2303.13716v2.pdf","comment":"TACL 2023"},{"id":"http://arxiv.org/abs/2310.02374v4","updated":"2024-01-23T21:27:14Z","published":"2023-10-03T18:54:10Z","title":"Conversational Health Agents: A Personalized LLM-Powered Agent Framework","summary":" Conversational Health Agents (CHAs) are interactive systems that provide\nhealthcare services, such as assistance and diagnosis. Current CHAs, especially\nthose utilizing Large Language Models (LLMs), primarily focus on conversation\naspects. However, they offer limited agent capabilities, specifically lacking\nmulti-step problem-solving, personalized conversations, and multimodal data\nanalysis. Our aim is to overcome these limitations. We propose openCHA, an\nopen-source LLM-powered framework, to empower conversational agents to generate\na personalized response for users' healthcare queries. This framework enables\ndevelopers to integrate external sources including data sources, knowledge\nbases, and analysis models, into their LLM-based solutions. openCHA includes an\norchestrator to plan and execute actions for gathering information from\nexternal sources, essential for formulating responses to user inquiries. It\nfacilitates knowledge acquisition, problem-solving capabilities, multilingual\nand multimodal conversations, and fosters interaction with various AI\nplatforms. We illustrate the framework's proficiency in handling complex\nhealthcare tasks via three demonstrations. Moreover, we release openCHA as open\nsource available to the community via GitHub.\n","authors":["Mahyar Abbasian","Iman Azimi","Amir M. Rahmani","Ramesh Jain"],"pdf_url":"https://arxiv.org/pdf/2310.02374v4.pdf","comment":"23 pages, 6 figures, 3 tables, journal paper"},{"id":"http://arxiv.org/abs/2305.08809v2","updated":"2024-01-23T21:25:20Z","published":"2023-05-15T17:15:40Z","title":"Interpretability at Scale: Identifying Causal Mechanisms in Alpaca","summary":" Obtaining human-interpretable explanations of large, general-purpose language\nmodels is an urgent goal for AI safety. However, it is just as important that\nour interpretability methods are faithful to the causal dynamics underlying\nmodel behavior and able to robustly generalize to unseen inputs. Distributed\nAlignment Search (DAS) is a powerful gradient descent method grounded in a\ntheory of causal abstraction that has uncovered perfect alignments between\ninterpretable symbolic algorithms and small deep learning models fine-tuned for\nspecific tasks. In the present paper, we scale DAS significantly by replacing\nthe remaining brute-force search steps with learned parameters -- an approach\nwe call Boundless DAS. This enables us to efficiently search for interpretable\ncausal structure in large language models while they follow instructions. We\napply Boundless DAS to the Alpaca model (7B parameters), which, off the shelf,\nsolves a simple numerical reasoning problem. With Boundless DAS, we discover\nthat Alpaca does this by implementing a causal model with two interpretable\nboolean variables. Furthermore, we find that the alignment of neural\nrepresentations with these variables is robust to changes in inputs and\ninstructions. These findings mark a first step toward faithfully understanding\nthe inner-workings of our ever-growing and most widely deployed language\nmodels. Our tool is extensible to larger LLMs and is released publicly at\n`https://github.com/stanfordnlp/pyvene`.\n","authors":["Zhengxuan Wu","Atticus Geiger","Christopher Potts","Noah D. Goodman"],"pdf_url":"https://arxiv.org/pdf/2305.08809v2.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2401.13086v1","updated":"2024-01-23T20:55:49Z","published":"2024-01-23T20:55:49Z","title":"Towards Trustable Language Models: Investigating Information Quality of\n Large Language Models","summary":" Large language models (LLM) are generating information at a rapid pace,\nrequiring users to increasingly rely and trust the data. Despite remarkable\nadvances of LLM, Information generated by LLM is not completely trustworthy,\ndue to challenges in information quality. Specifically, integrity of\nInformation quality decreases due to unreliable, biased, tokenization during\npre-training of LLM. Moreover, due to decreased information quality issues, has\nled towards hallucination, fabricated information. Unreliable information can\nlead towards flawed decisions in businesses, which impacts economic activity.\nIn this work, we introduce novel mathematical information quality evaluation of\nLLM, we furthermore analyze and highlight information quality challenges,\nscaling laws to systematically scale language models.\n","authors":["Rick Rejeleene","Xiaowei Xu","John Talburt"],"pdf_url":"https://arxiv.org/pdf/2401.13086v1.pdf","comment":"31 pages"},{"id":"http://arxiv.org/abs/2306.08877v3","updated":"2024-01-23T20:55:48Z","published":"2023-06-15T06:21:44Z","title":"Linguistic Binding in Diffusion Models: Enhancing Attribute\n Correspondence through Attention Map Alignment","summary":" Text-conditioned image generation models often generate incorrect\nassociations between entities and their visual attributes. This reflects an\nimpaired mapping between linguistic binding of entities and modifiers in the\nprompt and visual binding of the corresponding elements in the generated image.\nAs one notable example, a query like \"a pink sunflower and a yellow flamingo\"\nmay incorrectly produce an image of a yellow sunflower and a pink flamingo. To\nremedy this issue, we propose SynGen, an approach which first syntactically\nanalyses the prompt to identify entities and their modifiers, and then uses a\nnovel loss function that encourages the cross-attention maps to agree with the\nlinguistic binding reflected by the syntax. Specifically, we encourage large\noverlap between attention maps of entities and their modifiers, and small\noverlap with other entities and modifier words. The loss is optimized during\ninference, without retraining or fine-tuning the model. Human evaluation on\nthree datasets, including one new and challenging set, demonstrate significant\nimprovements of SynGen compared with current state of the art methods. This\nwork highlights how making use of sentence structure during inference can\nefficiently and substantially improve the faithfulness of text-to-image\ngeneration.\n","authors":["Royi Rassin","Eran Hirsch","Daniel Glickman","Shauli Ravfogel","Yoav Goldberg","Gal Chechik"],"pdf_url":"https://arxiv.org/pdf/2306.08877v3.pdf","comment":"Accepted to NeurIPS 2023 (oral). Our code is publicly available at\n https://github.com/RoyiRa/Syntax-Guided-Generation"},{"id":"http://arxiv.org/abs/2401.13085v1","updated":"2024-01-23T20:54:40Z","published":"2024-01-23T20:54:40Z","title":"IndiText Boost: Text Augmentation for Low Resource India Languages","summary":" Text Augmentation is an important task for low-resource languages. It helps\ndeal with the problem of data scarcity. A data augmentation strategy is used to\ndeal with the problem of data scarcity. Through the years, much work has been\ndone on data augmentation for the English language. In contrast, very less work\nhas been done on Indian languages. This is contrary to the fact that data\naugmentation is used to deal with data scarcity. In this work, we focus on\nimplementing techniques like Easy Data Augmentation, Back Translation,\nParaphrasing, Text Generation using LLMs, and Text Expansion using LLMs for\ntext classification on different languages. We focus on 6 Indian languages\nnamely: Sindhi, Marathi, Hindi, Gujarati, Telugu, and Sanskrit. According to\nour knowledge, no such work exists for text augmentation on Indian languages.\nWe carry out binary as well as multi-class text classification to make our\nresults more comparable. We get surprising results as basic data augmentation\ntechniques surpass LLMs.\n","authors":["Onkar Litake","Niraj Yagnik","Shreyas Labhsetwar"],"pdf_url":"https://arxiv.org/pdf/2401.13085v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.15812v2","updated":"2024-01-23T20:44:17Z","published":"2023-08-30T07:35:32Z","title":"Peering Through Preferences: Unraveling Feedback Acquisition for\n Aligning Large Language Models","summary":" Aligning large language models (LLMs) with human values and intents\ncritically involves the use of human or AI feedback. While dense feedback\nannotations are expensive to acquire and integrate, sparse feedback presents a\nstructural design choice between ratings (e.g., score Response A on a scale of\n1-7) and rankings (e.g., is Response A better than Response B?). In this work,\nwe analyze the effect of this design choice for the alignment and evaluation of\nLLMs. We uncover an inconsistency problem wherein the preferences inferred from\nratings and rankings significantly disagree 60% for both human and AI\nannotators. Our subsequent analysis identifies various facets of annotator\nbiases that explain this phenomena, such as human annotators would rate denser\nresponses higher while preferring accuracy during pairwise judgments. To our\nsurprise, we also observe that the choice of feedback protocol also has a\nsignificant effect on the evaluation of aligned LLMs. In particular, we find\nthat LLMs that leverage rankings data for alignment (say model X) are preferred\nover those that leverage ratings data (say model Y), with a rank-based\nevaluation protocol (is X/Y's response better than reference response?) but not\nwith a rating-based evaluation protocol (score Rank X/Y's response on a scale\nof 1-7). Our findings thus shed light on critical gaps in methods for\nevaluating the real-world utility of language models and their strong\ndependence on the feedback protocol used for alignment. Our code and data are\navailable at https://github.com/Hritikbansal/sparse_feedback.\n","authors":["Hritik Bansal","John Dang","Aditya Grover"],"pdf_url":"https://arxiv.org/pdf/2308.15812v2.pdf","comment":"31 pages, Accepted to ICLR 2024"},{"id":"http://arxiv.org/abs/2401.10841v2","updated":"2024-01-23T20:05:30Z","published":"2024-01-19T17:40:50Z","title":"Using LLMs to discover emerging coded antisemitic hate-speech in\n extremist social media","summary":" Online hate speech proliferation has created a difficult problem for social\nmedia platforms. A particular challenge relates to the use of coded language by\ngroups interested in both creating a sense of belonging for its users and\nevading detection. Coded language evolves quickly and its use varies over time.\nThis paper proposes a methodology for detecting emerging coded hate-laden\nterminology. The methodology is tested in the context of online antisemitic\ndiscourse. The approach considers posts scraped from social media platforms,\noften used by extremist users. The posts are scraped using seed expressions\nrelated to previously known discourse of hatred towards Jews. The method begins\nby identifying the expressions most representative of each post and calculating\ntheir frequency in the whole corpus. It filters out grammatically incoherent\nexpressions as well as previously encountered ones so as to focus on emergent\nwell-formed terminology. This is followed by an assessment of semantic\nsimilarity to known antisemitic terminology using a fine-tuned large language\nmodel, and subsequent filtering out of the expressions that are too distant\nfrom known expressions of hatred. Emergent antisemitic expressions containing\nterms clearly relating to Jewish topics are then removed to return only coded\nexpressions of hatred.\n","authors":["Dhanush Kikkisetti","Raza Ul Mustafa","Wendy Melillo","Roberto Corizzo","Zois Boukouvalas","Jeff Gill","Nathalie Japkowicz"],"pdf_url":"https://arxiv.org/pdf/2401.10841v2.pdf","comment":"9 pages, 4 figures, 2 algorithms, 3 tables"},{"id":"http://arxiv.org/abs/2401.11120v2","updated":"2024-01-23T19:43:06Z","published":"2024-01-20T05:10:46Z","title":"Enhancing Large Language Models for Clinical Decision Support by\n Incorporating Clinical Practice Guidelines","summary":" Background Large Language Models (LLMs), enhanced with Clinical Practice\nGuidelines (CPGs), can significantly improve Clinical Decision Support (CDS).\nHowever, methods for incorporating CPGs into LLMs are not well studied. Methods\nWe develop three distinct methods for incorporating CPGs into LLMs: Binary\nDecision Tree (BDT), Program-Aided Graph Construction (PAGC), and\nChain-of-Thought-Few-Shot Prompting (CoT-FSP). To evaluate the effectiveness of\nthe proposed methods, we create a set of synthetic patient descriptions and\nconduct both automatic and human evaluation of the responses generated by four\nLLMs: GPT-4, GPT-3.5 Turbo, LLaMA, and PaLM 2. Zero-Shot Prompting (ZSP) was\nused as the baseline method. We focus on CDS for COVID-19 outpatient treatment\nas the case study. Results All four LLMs exhibit improved performance when\nenhanced with CPGs compared to the baseline ZSP. BDT outperformed both CoT-FSP\nand PAGC in automatic evaluation. All of the proposed methods demonstrated high\nperformance in human evaluation. Conclusion LLMs enhanced with CPGs demonstrate\nsuperior performance, as compared to plain LLMs with ZSP, in providing accurate\nrecommendations for COVID-19 outpatient treatment, which also highlights the\npotential for broader applications beyond the case study.\n","authors":["David Oniani","Xizhi Wu","Shyam Visweswaran","Sumit Kapoor","Shravan Kooragayalu","Katelyn Polanska","Yanshan Wang"],"pdf_url":"https://arxiv.org/pdf/2401.11120v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.11803v2","updated":"2024-01-23T19:37:20Z","published":"2023-12-19T02:35:13Z","title":"NLP for Maternal Healthcare: Perspectives and Guiding Principles in the\n Age of LLMs","summary":" Ethical frameworks for the use of natural language processing (NLP) are\nurgently needed to shape how large language models (LLMs) and similar tools are\nused for healthcare applications. Healthcare faces existing challenges\nincluding the balance of power in clinician-patient relationships, systemic\nhealth disparities, historical injustices, and economic constraints. Drawing\ndirectly from the voices of those most affected, and focusing on a case study\nof a specific healthcare setting, we propose a set of guiding principles for\nthe use of NLP in maternal healthcare. We led an interactive session centered\non an LLM-based chatbot demonstration during a full-day workshop with 39\nparticipants, and additionally surveyed 30 healthcare workers and 30 birthing\npeople about their values, needs, and perceptions of NLP tools in the context\nof maternal health. We conducted quantitative and qualitative analyses of the\nsurvey results and interactive discussions to consolidate our findings into a\nset of guiding principles. We propose nine principles for ethical use of NLP\nfor maternal healthcare, grouped into three themes: (i) recognizing contextual\nsignificance (ii) holistic measurements, and (iii) who/what is valued. For each\nprinciple, we describe its underlying rationale and provide practical advice.\nThis set of principles can provide a methodological pattern for other\nresearchers and serve as a resource to practitioners working on maternal health\nand other healthcare fields to emphasize the importance of technical nuance,\nhistorical context, and inclusive design when developing NLP technologies for\nclinical use.\n","authors":["Maria Antoniak","Aakanksha Naik","Carla S. Alvarado","Lucy Lu Wang","Irene Y. Chen"],"pdf_url":"https://arxiv.org/pdf/2312.11803v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.13060v1","updated":"2024-01-23T19:32:54Z","published":"2024-01-23T19:32:54Z","title":"TCE at Qur'an QA 2023 Shared Task: Low Resource Enhanced\n Transformer-based Ensemble Approach for Qur'anic QA","summary":" In this paper, we present our approach to tackle Qur'an QA 2023 shared tasks\nA and B. To address the challenge of low-resourced training data, we rely on\ntransfer learning together with a voting ensemble to improve prediction\nstability across multiple runs. Additionally, we employ different architectures\nand learning mechanisms for a range of Arabic pre-trained transformer-based\nmodels for both tasks. To identify unanswerable questions, we propose using a\nthresholding mechanism. Our top-performing systems greatly surpass the baseline\nperformance on the hidden split, achieving a MAP score of 25.05% for task A and\na partial Average Precision (pAP) of 57.11% for task B.\n","authors":["Mohammed Alaa Elkomy","Amany Sarhan"],"pdf_url":"https://arxiv.org/pdf/2401.13060v1.pdf","comment":null}],"Computer Vision and Pattern Recognition":[{"id":"http://arxiv.org/abs/2401.12978v1","updated":"2024-01-23T18:59:59Z","published":"2024-01-23T18:59:59Z","title":"Zero-Shot Learning for the Primitives of 3D Affordance in General\n Objects","summary":" One of the major challenges in AI is teaching machines to precisely respond\nand utilize environmental functionalities, thereby achieving the affordance\nawareness that humans possess. Despite its importance, the field has been\nlagging in terms of learning, especially in 3D, as annotating affordance\naccompanies a laborious process due to the numerous variations of human-object\ninteraction. The low availability of affordance data limits the learning in\nterms of generalization for object categories, and also simplifies the\nrepresentation of affordance, capturing only a fraction of the affordance. To\novercome these challenges, we propose a novel, self-supervised method to\ngenerate the 3D affordance examples given only a 3D object, without any manual\nannotations. The method starts by capturing the 3D object into images and\ncreating 2D affordance images by inserting humans into the image via inpainting\ndiffusion models, where we present the Adaptive Mask algorithm to enable human\ninsertion without altering the original details of the object. The method\nconsequently lifts inserted humans back to 3D to create 3D human-object pairs,\nwhere the depth ambiguity is resolved within a depth optimization framework\nthat utilizes pre-generated human postures from multiple viewpoints. We also\nprovide a novel affordance representation defined on relative orientations and\nproximity between dense human and object points, that can be easily aggregated\nfrom any 3D HOI datasets. The proposed representation serves as a primitive\nthat can be manifested to conventional affordance representations via simple\ntransformations, ranging from physically exerted affordances to nonphysical\nones. We demonstrate the efficacy of our method and representation by\ngenerating the 3D affordance samples and deriving high-quality affordance\nexamples from the representation, including contact, orientation, and spatial\noccupancies.\n","authors":["Hyeonwoo Kim","Sookwan Han","Patrick Kwon","Hanbyul Joo"],"pdf_url":"https://arxiv.org/pdf/2401.12978v1.pdf","comment":"Project Page: https://sshowbiz.github.io/ZSP3A/"},{"id":"http://arxiv.org/abs/2401.12979v1","updated":"2024-01-23T18:59:59Z","published":"2024-01-23T18:59:59Z","title":"GALA: Generating Animatable Layered Assets from a Single Scan","summary":" We present GALA, a framework that takes as input a single-layer clothed 3D\nhuman mesh and decomposes it into complete multi-layered 3D assets. The outputs\ncan then be combined with other assets to create novel clothed human avatars\nwith any pose. Existing reconstruction approaches often treat clothed humans as\na single-layer of geometry and overlook the inherent compositionality of humans\nwith hairstyles, clothing, and accessories, thereby limiting the utility of the\nmeshes for downstream applications. Decomposing a single-layer mesh into\nseparate layers is a challenging task because it requires the synthesis of\nplausible geometry and texture for the severely occluded regions. Moreover,\neven with successful decomposition, meshes are not normalized in terms of poses\nand body shapes, failing coherent composition with novel identities and poses.\nTo address these challenges, we propose to leverage the general knowledge of a\npretrained 2D diffusion model as geometry and appearance prior for humans and\nother assets. We first separate the input mesh using the 3D surface\nsegmentation extracted from multi-view 2D segmentations. Then we synthesize the\nmissing geometry of different layers in both posed and canonical spaces using a\nnovel pose-guided Score Distillation Sampling (SDS) loss. Once we complete\ninpainting high-fidelity 3D geometry, we also apply the same SDS loss to its\ntexture to obtain the complete appearance including the initially occluded\nregions. Through a series of decomposition steps, we obtain multiple layers of\n3D assets in a shared canonical space normalized in terms of poses and human\nshapes, hence supporting effortless composition to novel identities and\nreanimation with novel poses. Our experiments demonstrate the effectiveness of\nour approach for decomposition, canonicalization, and composition tasks\ncompared to existing solutions.\n","authors":["Taeksoo Kim","Byungjun Kim","Shunsuke Saito","Hanbyul Joo"],"pdf_url":"https://arxiv.org/pdf/2401.12979v1.pdf","comment":"The project page is available at https://snuvclab.github.io/gala/"},{"id":"http://arxiv.org/abs/2401.12977v1","updated":"2024-01-23T18:59:56Z","published":"2024-01-23T18:59:56Z","title":"IRIS: Inverse Rendering of Indoor Scenes from Low Dynamic Range Images","summary":" While numerous 3D reconstruction and novel-view synthesis methods allow for\nphotorealistic rendering of a scene from multi-view images easily captured with\nconsumer cameras, they bake illumination in their representations and fall\nshort of supporting advanced applications like material editing, relighting,\nand virtual object insertion. The reconstruction of physically based material\nproperties and lighting via inverse rendering promises to enable such\napplications.\n However, most inverse rendering techniques require high dynamic range (HDR)\nimages as input, a setting that is inaccessible to most users. We present a\nmethod that recovers the physically based material properties and\nspatially-varying HDR lighting of a scene from multi-view, low-dynamic-range\n(LDR) images. We model the LDR image formation process in our inverse rendering\npipeline and propose a novel optimization strategy for material, lighting, and\na camera response model. We evaluate our approach with synthetic and real\nscenes compared to the state-of-the-art inverse rendering methods that take\neither LDR or HDR input. Our method outperforms existing methods taking LDR\nimages as input, and allows for highly realistic relighting and object\ninsertion.\n","authors":["Zhi-Hao Lin","Jia-Bin Huang","Zhengqin Li","Zhao Dong","Christian Richardt","Tuotuo Li","Michael Zollhöfer","Johannes Kopf","Shenlong Wang","Changil Kim"],"pdf_url":"https://arxiv.org/pdf/2401.12977v1.pdf","comment":"Project Website: https://irisldr.github.io/"},{"id":"http://arxiv.org/abs/2401.04079v2","updated":"2024-01-23T18:59:52Z","published":"2024-01-08T18:31:38Z","title":"RudolfV: A Foundation Model by Pathologists for Pathologists","summary":" Histopathology plays a central role in clinical medicine and biomedical\nresearch. While artificial intelligence shows promising results on many\npathological tasks, generalization and dealing with rare diseases, where\ntraining data is scarce, remains a challenge. Distilling knowledge from\nunlabeled data into a foundation model before learning from, potentially\nlimited, labeled data provides a viable path to address these challenges. In\nthis work, we extend the state of the art of foundation models for digital\npathology whole slide images by semi-automated data curation and incorporating\npathologist domain knowledge. Specifically, we combine computational and\npathologist domain knowledge (1) to curate a diverse dataset of 103k slides\ncorresponding to 750 million image patches covering data from different\nfixation, staining, and scanning protocols as well as data from different\nindications and labs across the EU and US, (2) for grouping semantically\nsimilar slides and tissue patches, and (3) to augment the input images during\ntraining. We evaluate the resulting model on a set of public and internal\nbenchmarks and show that although our foundation model is trained with an order\nof magnitude less slides, it performs on par or better than competing models.\nWe expect that scaling our approach to more data and larger models will further\nincrease its performance and capacity to deal with increasingly complex real\nworld tasks in diagnostics and biomedical research.\n","authors":["Jonas Dippel","Barbara Feulner","Tobias Winterhoff","Simon Schallenberg","Gabriel Dernbach","Andreas Kunft","Stephan Tietz","Philipp Jurmeister","David Horst","Lukas Ruff","Klaus-Robert Müller","Frederick Klauschen","Maximilian Alber"],"pdf_url":"https://arxiv.org/pdf/2401.04079v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12975v1","updated":"2024-01-23T18:59:43Z","published":"2024-01-23T18:59:43Z","title":"HAZARD Challenge: Embodied Decision Making in Dynamically Changing\n Environments","summary":" Recent advances in high-fidelity virtual environments serve as one of the\nmajor driving forces for building intelligent embodied agents to perceive,\nreason and interact with the physical world. Typically, these environments\nremain unchanged unless agents interact with them. However, in real-world\nscenarios, agents might also face dynamically changing environments\ncharacterized by unexpected events and need to rapidly take action accordingly.\nTo remedy this gap, we propose a new simulated embodied benchmark, called\nHAZARD, specifically designed to assess the decision-making abilities of\nembodied agents in dynamic situations. HAZARD consists of three unexpected\ndisaster scenarios, including fire, flood, and wind, and specifically supports\nthe utilization of large language models (LLMs) to assist common sense\nreasoning and decision-making. This benchmark enables us to evaluate autonomous\nagents' decision-making capabilities across various pipelines, including\nreinforcement learning (RL), rule-based, and search-based methods in\ndynamically changing environments. As a first step toward addressing this\nchallenge using large language models, we further develop an LLM-based agent\nand perform an in-depth analysis of its promise and challenge of solving these\nchallenging tasks. HAZARD is available at https://vis-www.cs.umass.edu/hazard/.\n","authors":["Qinhong Zhou","Sunli Chen","Yisong Wang","Haozhe Xu","Weihua Du","Hongxin Zhang","Yilun Du","Joshua B. Tenenbaum","Chuang Gan"],"pdf_url":"https://arxiv.org/pdf/2401.12975v1.pdf","comment":"ICLR 2024. The first two authors contributed equally to this work"},{"id":"http://arxiv.org/abs/2312.12433v2","updated":"2024-01-23T18:59:39Z","published":"2023-12-19T18:58:40Z","title":"Tracking Any Object Amodally","summary":" Amodal perception, the ability to comprehend complete object structures from\npartial visibility, is a fundamental skill, even for infants. Its significance\nextends to applications like autonomous driving, where a clear understanding of\nheavily occluded objects is essential. However, modern detection and tracking\nalgorithms often overlook this critical capability, perhaps due to the\nprevalence of modal annotations in most datasets. To address the scarcity of\namodal data, we introduce the TAO-Amodal benchmark, featuring 880 diverse\ncategories in thousands of video sequences. Our dataset includes amodal and\nmodal bounding boxes for visible and occluded objects, including objects that\nare partially out-of-frame. To enhance amodal tracking with object permanence,\nwe leverage a lightweight plug-in module, the amodal expander, to transform\nstandard, modal trackers into amodal ones through fine-tuning on a few hundred\nvideo sequences with data augmentation. We achieve a 3.3\\% and 1.6\\%\nimprovement on the detection and tracking of occluded objects on TAO-Amodal.\nWhen evaluated on people, our method produces dramatic improvements of 2x\ncompared to state-of-the-art modal baselines.\n","authors":["Cheng-Yen Hsieh","Tarasha Khurana","Achal Dave","Deva Ramanan"],"pdf_url":"https://arxiv.org/pdf/2312.12433v2.pdf","comment":"Project Page: https://tao-amodal.github.io"},{"id":"http://arxiv.org/abs/2401.12974v1","updated":"2024-01-23T18:59:25Z","published":"2024-01-23T18:59:25Z","title":"SegmentAnyBone: A Universal Model that Segments Any Bone at Any Location\n on MRI","summary":" Magnetic Resonance Imaging (MRI) is pivotal in radiology, offering\nnon-invasive and high-quality insights into the human body. Precise\nsegmentation of MRIs into different organs and tissues would be highly\nbeneficial since it would allow for a higher level of understanding of the\nimage content and enable important measurements, which are essential for\naccurate diagnosis and effective treatment planning. Specifically, segmenting\nbones in MRI would allow for more quantitative assessments of musculoskeletal\nconditions, while such assessments are largely absent in current radiological\npractice. The difficulty of bone MRI segmentation is illustrated by the fact\nthat limited algorithms are publicly available for use, and those contained in\nthe literature typically address a specific anatomic area. In our study, we\npropose a versatile, publicly available deep-learning model for bone\nsegmentation in MRI across multiple standard MRI locations. The proposed model\ncan operate in two modes: fully automated segmentation and prompt-based\nsegmentation. Our contributions include (1) collecting and annotating a new MRI\ndataset across various MRI protocols, encompassing over 300 annotated volumes\nand 8485 annotated slices across diverse anatomic regions; (2) investigating\nseveral standard network architectures and strategies for automated\nsegmentation; (3) introducing SegmentAnyBone, an innovative foundational\nmodel-based approach that extends Segment Anything Model (SAM); (4) comparative\nanalysis of our algorithm and previous approaches; and (5) generalization\nanalysis of our algorithm across different anatomical locations and MRI\nsequences, as well as an external dataset. We publicly release our model at\nhttps://github.com/mazurowski-lab/SegmentAnyBone.\n","authors":["Hanxue Gu","Roy Colglazier","Haoyu Dong","Jikai Zhang","Yaqian Chen","Zafer Yildiz","Yuwen Chen","Lin Li","Jichen Yang","Jay Willhite","Alex M. Meyer","Brian Guo","Yashvi Atul Shah","Emily Luo","Shipra Rajput","Sally Kuehn","Clark Bulleit","Kevin A. Wu","Jisoo Lee","Brandon Ramirez","Darui Lu","Jay M. Levin","Maciej A. Mazurowski"],"pdf_url":"https://arxiv.org/pdf/2401.12974v1.pdf","comment":"15 pages, 15 figures"},{"id":"http://arxiv.org/abs/2401.12972v1","updated":"2024-01-23T18:58:35Z","published":"2024-01-23T18:58:35Z","title":"On the Efficacy of Text-Based Input Modalities for Action Anticipation","summary":" Although the task of anticipating future actions is highly uncertain,\ninformation from additional modalities help to narrow down plausible action\nchoices. Each modality provides different environmental context for the model\nto learn from. While previous multi-modal methods leverage information from\nmodalities such as video and audio, we primarily explore how text inputs for\nactions and objects can also enable more accurate action anticipation.\nTherefore, we propose a Multi-modal Anticipative Transformer (MAT), an\nattention-based video transformer architecture that jointly learns from\nmulti-modal features and text captions. We train our model in two-stages, where\nthe model first learns to predict actions in the video clip by aligning with\ncaptions, and during the second stage, we fine-tune the model to predict future\nactions. Compared to existing methods, MAT has the advantage of learning\nadditional environmental context from two kinds of text inputs: action\ndescriptions during the pre-training stage, and the text inputs for detected\nobjects and actions during modality feature fusion. Through extensive\nexperiments, we evaluate the effectiveness of the pre-training stage, and show\nthat our model outperforms previous methods on all datasets. In addition, we\nexamine the impact of object and action information obtained via text and\nperform extensive ablations. We evaluate the performance on on three datasets:\nEpicKitchens-100, EpicKitchens-55 and EGTEA GAZE+; and show that text\ndescriptions do indeed aid in more effective action anticipation.\n","authors":["Apoorva Beedu","Karan Samel","Irfan Essa"],"pdf_url":"https://arxiv.org/pdf/2401.12972v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12963v1","updated":"2024-01-23T18:45:54Z","published":"2024-01-23T18:45:54Z","title":"AutoRT: Embodied Foundation Models for Large Scale Orchestration of\n Robotic Agents","summary":" Foundation models that incorporate language, vision, and more recently\nactions have revolutionized the ability to harness internet scale data to\nreason about useful tasks. However, one of the key challenges of training\nembodied foundation models is the lack of data grounded in the physical world.\nIn this paper, we propose AutoRT, a system that leverages existing foundation\nmodels to scale up the deployment of operational robots in completely unseen\nscenarios with minimal human supervision. AutoRT leverages vision-language\nmodels (VLMs) for scene understanding and grounding, and further uses large\nlanguage models (LLMs) for proposing diverse and novel instructions to be\nperformed by a fleet of robots. Guiding data collection by tapping into the\nknowledge of foundation models enables AutoRT to effectively reason about\nautonomy tradeoffs and safety while significantly scaling up data collection\nfor robot learning. We demonstrate AutoRT proposing instructions to over 20\nrobots across multiple buildings and collecting 77k real robot episodes via\nboth teleoperation and autonomous robot policies. We experimentally show that\nsuch \"in-the-wild\" data collected by AutoRT is significantly more diverse, and\nthat AutoRT's use of LLMs allows for instruction following data collection\nrobots that can align to human preferences.\n","authors":["Michael Ahn","Debidatta Dwibedi","Chelsea Finn","Montse Gonzalez Arenas","Keerthana Gopalakrishnan","Karol Hausman","Brian Ichter","Alex Irpan","Nikhil Joshi","Ryan Julian","Sean Kirmani","Isabel Leal","Edward Lee","Sergey Levine","Yao Lu","Isabel Leal","Sharath Maddineni","Kanishka Rao","Dorsa Sadigh","Pannag Sanketi","Pierre Sermanet","Quan Vuong","Stefan Welker","Fei Xia","Ted Xiao","Peng Xu","Steve Xu","Zhuo Xu"],"pdf_url":"https://arxiv.org/pdf/2401.12963v1.pdf","comment":"26 pages, 9 figures"},{"id":"http://arxiv.org/abs/2303.07700v3","updated":"2024-01-23T18:37:41Z","published":"2023-03-14T08:28:36Z","title":"PATS: Patch Area Transportation with Subdivision for Local Feature\n Matching","summary":" Local feature matching aims at establishing sparse correspondences between a\npair of images. Recently, detector-free methods present generally better\nperformance but are not satisfactory in image pairs with large scale\ndifferences. In this paper, we propose Patch Area Transportation with\nSubdivision (PATS) to tackle this issue. Instead of building an expensive image\npyramid, we start by splitting the original image pair into equal-sized patches\nand gradually resizing and subdividing them into smaller patches with the same\nscale. However, estimating scale differences between these patches is\nnon-trivial since the scale differences are determined by both relative camera\nposes and scene structures, and thus spatially varying over image pairs.\nMoreover, it is hard to obtain the ground truth for real scenes. To this end,\nwe propose patch area transportation, which enables learning scale differences\nin a self-supervised manner. In contrast to bipartite graph matching, which\nonly handles one-to-one matching, our patch area transportation can deal with\nmany-to-many relationships. PATS improves both matching accuracy and coverage,\nand shows superior performance in downstream tasks, such as relative pose\nestimation, visual localization, and optical flow estimation. The source code\nis available at \\url{https://zju3dv.github.io/pats/}.\n","authors":["Junjie Ni","Yijin Li","Zhaoyang Huang","Hongsheng Li","Hujun Bao","Zhaopeng Cui","Guofeng Zhang"],"pdf_url":"https://arxiv.org/pdf/2303.07700v3.pdf","comment":"Accepted to CVPR 2023. Project page: https://zju3dv.github.io/pats"},{"id":"http://arxiv.org/abs/2401.12946v1","updated":"2024-01-23T18:07:07Z","published":"2024-01-23T18:07:07Z","title":"Coverage Axis++: Efficient Inner Point Selection for 3D Shape\n Skeletonization","summary":" We introduce Coverage Axis++, a novel and efficient approach to 3D shape\nskeletonization. The current state-of-the-art approaches for this task often\nrely on the watertightness of the input or suffer from substantial\ncomputational costs, thereby limiting their practicality. To address this\nchallenge, Coverage Axis++ proposes a heuristic algorithm to select skeletal\npoints, offering a high-accuracy approximation of the Medial Axis Transform\n(MAT) while significantly mitigating computational intensity for various shape\nrepresentations. We introduce a simple yet effective strategy that considers\nboth shape coverage and uniformity to derive skeletal points. The selection\nprocedure enforces consistency with the shape structure while favoring the\ndominant medial balls, which thus introduces a compact underlying shape\nrepresentation in terms of MAT. As a result, Coverage Axis++ allows for\nskeletonization for various shape representations (e.g., water-tight meshes,\ntriangle soups, point clouds), specification of the number of skeletal points,\nfew hyperparameters, and highly efficient computation with improved\nreconstruction accuracy. Extensive experiments across a wide range of 3D shapes\nvalidate the efficiency and effectiveness of Coverage Axis++. The code will be\npublicly available once the paper is published.\n","authors":["Zimeng Wang","Zhiyang Dou","Rui Xu","Cheng Lin","Yuan Liu","Xiaoxiao Long","Shiqing Xin","Lingjie Liu","Taku Komura","Xiaoming Yuan","Wenping Wang"],"pdf_url":"https://arxiv.org/pdf/2401.12946v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12945v1","updated":"2024-01-23T18:05:25Z","published":"2024-01-23T18:05:25Z","title":"Lumiere: A Space-Time Diffusion Model for Video Generation","summary":" We introduce Lumiere -- a text-to-video diffusion model designed for\nsynthesizing videos that portray realistic, diverse and coherent motion -- a\npivotal challenge in video synthesis. To this end, we introduce a Space-Time\nU-Net architecture that generates the entire temporal duration of the video at\nonce, through a single pass in the model. This is in contrast to existing video\nmodels which synthesize distant keyframes followed by temporal super-resolution\n-- an approach that inherently makes global temporal consistency difficult to\nachieve. By deploying both spatial and (importantly) temporal down- and\nup-sampling and leveraging a pre-trained text-to-image diffusion model, our\nmodel learns to directly generate a full-frame-rate, low-resolution video by\nprocessing it in multiple space-time scales. We demonstrate state-of-the-art\ntext-to-video generation results, and show that our design easily facilitates a\nwide range of content creation tasks and video editing applications, including\nimage-to-video, video inpainting, and stylized generation.\n","authors":["Omer Bar-Tal","Hila Chefer","Omer Tov","Charles Herrmann","Roni Paiss","Shiran Zada","Ariel Ephrat","Junhwa Hur","Yuanzhen Li","Tomer Michaeli","Oliver Wang","Deqing Sun","Tali Dekel","Inbar Mosseri"],"pdf_url":"https://arxiv.org/pdf/2401.12945v1.pdf","comment":"Webpage: https://lumiere-video.github.io/ | Video:\n https://www.youtube.com/watch?v=wxLr02Dz2Sc"},{"id":"http://arxiv.org/abs/2401.11114v2","updated":"2024-01-23T18:00:13Z","published":"2024-01-20T04:55:29Z","title":"DengueNet: Dengue Prediction using Spatiotemporal Satellite Imagery for\n Resource-Limited Countries","summary":" Dengue fever presents a substantial challenge in developing countries where\nsanitation infrastructure is inadequate. The absence of comprehensive\nhealthcare systems exacerbates the severity of dengue infections, potentially\nleading to life-threatening circumstances. Rapid response to dengue outbreaks\nis also challenging due to limited information exchange and integration. While\ntimely dengue outbreak forecasts have the potential to prevent such outbreaks,\nthe majority of dengue prediction studies have predominantly relied on data\nthat impose significant burdens on individual countries for collection. In this\nstudy, our aim is to improve health equity in resource-constrained countries by\nexploring the effectiveness of high-resolution satellite imagery as a\nnontraditional and readily accessible data source. By leveraging the wealth of\npublicly available and easily obtainable satellite imagery, we present a\nscalable satellite extraction framework based on Sentinel Hub, a cloud-based\ncomputing platform. Furthermore, we introduce DengueNet, an innovative\narchitecture that combines Vision Transformer, Radiomics, and Long Short-term\nMemory to extract and integrate spatiotemporal features from satellite images.\nThis enables dengue predictions on an epi-week basis. To evaluate the\neffectiveness of our proposed method, we conducted experiments on five\nmunicipalities in Colombia. We utilized a dataset comprising 780\nhigh-resolution Sentinel-2 satellite images for training and evaluation. The\nperformance of DengueNet was assessed using the mean absolute error (MAE)\nmetric. Across the five municipalities, DengueNet achieved an average MAE of\n43.92. Our findings strongly support the efficacy of satellite imagery as a\nvaluable resource for dengue prediction, particularly in informing public\nhealth policies within countries where manually collected data is scarce and\ndengue virus prevalence is severe.\n","authors":["Kuan-Ting Kuo","Dana Moukheiber","Sebastian Cajas Ordonez","David Restrepo","Atika Rahman Paddo","Tsung-Yu Chen","Lama Moukheiber","Mira Moukheiber","Sulaiman Moukheiber","Saptarshi Purkayastha","Po-Chih Kuo","Leo Anthony Celi"],"pdf_url":"https://arxiv.org/pdf/2401.11114v2.pdf","comment":"Published at the IJCAI 2023 Workshop on Bridge-AI: from Climate\n Change to Health Equity (BridgeAICCHE)., Macao, S.A.R"},{"id":"http://arxiv.org/abs/2401.12938v1","updated":"2024-01-23T17:50:58Z","published":"2024-01-23T17:50:58Z","title":"Neural deformation fields for template-based reconstruction of cortical\n surfaces from MRI","summary":" The reconstruction of cortical surfaces is a prerequisite for quantitative\nanalyses of the cerebral cortex in magnetic resonance imaging (MRI). Existing\nsegmentation-based methods separate the surface registration from the surface\nextraction, which is computationally inefficient and prone to distortions. We\nintroduce Vox2Cortex-Flow (V2C-Flow), a deep mesh-deformation technique that\nlearns a deformation field from a brain template to the cortical surfaces of an\nMRI scan. To this end, we present a geometric neural network that models the\ndeformation-describing ordinary differential equation in a continuous manner.\nThe network architecture comprises convolutional and graph-convolutional\nlayers, which allows it to work with images and meshes at the same time.\nV2C-Flow is not only very fast, requiring less than two seconds to infer all\nfour cortical surfaces, but also establishes vertex-wise correspondences to the\ntemplate during reconstruction. In addition, V2C-Flow is the first approach for\ncortex reconstruction that models white matter and pial surfaces jointly,\ntherefore avoiding intersections between them. Our comprehensive experiments on\ninternal and external test data demonstrate that V2C-Flow results in cortical\nsurfaces that are state-of-the-art in terms of accuracy. Moreover, we show that\nthe established correspondences are more consistent than in FreeSurfer and that\nthey can directly be utilized for cortex parcellation and group analyses of\ncortical thickness.\n","authors":["Fabian Bongratz","Anne-Marie Rickmann","Christian Wachinger"],"pdf_url":"https://arxiv.org/pdf/2401.12938v1.pdf","comment":"To appear in Medical Image Analysis"},{"id":"http://arxiv.org/abs/2401.12932v1","updated":"2024-01-23T17:37:34Z","published":"2024-01-23T17:37:34Z","title":"Segmentation of tibiofemoral joint tissues from knee MRI using MtRA-Unet\n and incorporating shape information: Data from the Osteoarthritis Initiative","summary":" Knee Osteoarthritis (KOA) is the third most prevalent Musculoskeletal\nDisorder (MSD) after neck and back pain. To monitor such a severe MSD, a\nsegmentation map of the femur, tibia and tibiofemoral cartilage is usually\naccessed using the automated segmentation algorithm from the Magnetic Resonance\nImaging (MRI) of the knee. But, in recent works, such segmentation is\nconceivable only from the multistage framework thus creating data handling\nissues and needing continuous manual inference rendering it unable to make a\nquick and precise clinical diagnosis. In order to solve these issues, in this\npaper the Multi-Resolution Attentive-Unet (MtRA-Unet) is proposed to segment\nthe femur, tibia and tibiofemoral cartilage automatically. The proposed work\nhas included a novel Multi-Resolution Feature Fusion (MRFF) and Shape\nReconstruction (SR) loss that focuses on multi-contextual information and\nstructural anatomical details of the femur, tibia and tibiofemoral cartilage.\nUnlike previous approaches, the proposed work is a single-stage and end-to-end\nframework producing a Dice Similarity Coefficient (DSC) of 98.5% for the femur,\n98.4% for the tibia, 89.1% for Femoral Cartilage (FC) and 86.1% for Tibial\nCartilage (TC) for critical MRI slices that can be helpful to clinicians for\nKOA grading. The time to segment MRI volume (160 slices) per subject is 22 sec.\nwhich is one of the fastest among state-of-the-art. Moreover, comprehensive\nexperimentation on the segmentation of FC and TC which is of utmost importance\nfor morphology-based studies to check KOA progression reveals that the proposed\nmethod has produced an excellent result with binary segmentation\n","authors":["Akshay Daydar","Alik Pramanick","Arijit Sur","Subramani Kanagaraj"],"pdf_url":"https://arxiv.org/pdf/2401.12932v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12915v1","updated":"2024-01-23T17:07:18Z","published":"2024-01-23T17:07:18Z","title":"Red Teaming Visual Language Models","summary":" VLMs (Vision-Language Models) extend the capabilities of LLMs (Large Language\nModels) to accept multimodal inputs. Since it has been verified that LLMs can\nbe induced to generate harmful or inaccurate content through specific test\ncases (termed as Red Teaming), how VLMs perform in similar scenarios,\nespecially with their combination of textual and visual inputs, remains a\nquestion. To explore this problem, we present a novel red teaming dataset\nRTVLM, which encompasses 10 subtasks (e.g., image misleading, multi-modal\njail-breaking, face fairness, etc) under 4 primary aspects (faithfulness,\nprivacy, safety, fairness). Our RTVLM is the first red-teaming dataset to\nbenchmark current VLMs in terms of these 4 different aspects. Detailed analysis\nshows that 10 prominent open-sourced VLMs struggle with the red teaming in\ndifferent degrees and have up to 31% performance gap with GPT-4V. Additionally,\nwe simply apply red teaming alignment to LLaVA-v1.5 with Supervised Fine-tuning\n(SFT) using RTVLM, and this bolsters the models' performance with 10% in RTVLM\ntest set, 13% in MM-Hal, and without noticeable decline in MM-Bench,\noverpassing other LLaVA-based models with regular alignment data. This reveals\nthat current open-sourced VLMs still lack red teaming alignment. Our code and\ndatasets will be open-source.\n","authors":["Mukai Li","Lei Li","Yuwei Yin","Masood Ahmed","Zhenguang Liu","Qi Liu"],"pdf_url":"https://arxiv.org/pdf/2401.12915v1.pdf","comment":"Working in progress"},{"id":"http://arxiv.org/abs/2312.02218v2","updated":"2024-01-23T16:53:48Z","published":"2023-12-03T15:19:08Z","title":"WavePlanes: A compact Wavelet representation for Dynamic Neural Radiance\n Fields","summary":" Dynamic Neural Radiance Fields (Dynamic NeRF) enhance NeRF technology to\nmodel moving scenes. However, they are resource intensive and challenging to\ncompress. To address this issue, this paper presents WavePlanes, a fast and\nmore compact explicit model. We propose a multi-scale space and space-time\nfeature plane representation using N-level 2-D wavelet coefficients. The\ninverse discrete wavelet transform reconstructs N feature signals at varying\ndetail, which are linearly decoded to approximate the color and density of\nvolumes in a 4-D grid. Exploiting the sparsity of wavelet coefficients, we\ncompress a Hash Map containing only non-zero coefficients and their locations\non each plane. This results in a compressed model size of ~12 MB. Compared with\nstate-of-the-art plane-based models, WavePlanes is up to 15x smaller, less\ncomputationally demanding and achieves comparable results in as little as one\nhour of training - without requiring custom CUDA code or high performance\ncomputing resources. Additionally, we propose new feature fusion schemes that\nwork as well as previously proposed schemes while providing greater\ninterpretability. Our code is available at:\nhttps://github.com/azzarelli/waveplanes/\n","authors":["Adrian Azzarelli","Nantheera Anantrasirichai","David R Bull"],"pdf_url":"https://arxiv.org/pdf/2312.02218v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12902v1","updated":"2024-01-23T16:48:18Z","published":"2024-01-23T16:48:18Z","title":"Facing the Elephant in the Room: Visual Prompt Tuning or Full\n Finetuning?","summary":" As the scale of vision models continues to grow, the emergence of Visual\nPrompt Tuning (VPT) as a parameter-efficient transfer learning technique has\ngained attention due to its superior performance compared to traditional\nfull-finetuning. However, the conditions favoring VPT (the ``when\") and the\nunderlying rationale (the ``why\") remain unclear. In this paper, we conduct a\ncomprehensive analysis across 19 distinct datasets and tasks. To understand the\n``when\" aspect, we identify the scenarios where VPT proves favorable by two\ndimensions: task objectives and data distributions. We find that VPT is\npreferrable when there is 1) a substantial disparity between the original and\nthe downstream task objectives (e.g., transitioning from classification to\ncounting), or 2) a similarity in data distributions between the two tasks\n(e.g., both involve natural images). In exploring the ``why\" dimension, our\nresults indicate VPT's success cannot be attributed solely to overfitting and\noptimization considerations. The unique way VPT preserves original features and\nadds parameters appears to be a pivotal factor. Our study provides insights\ninto VPT's mechanisms, and offers guidance for its optimal utilization.\n","authors":["Cheng Han","Qifan Wang","Yiming Cui","Wenguan Wang","Lifu Huang","Siyuan Qi","Dongfang Liu"],"pdf_url":"https://arxiv.org/pdf/2401.12902v1.pdf","comment":"29 pages, 19 figures"},{"id":"http://arxiv.org/abs/2401.12900v1","updated":"2024-01-23T16:40:47Z","published":"2024-01-23T16:40:47Z","title":"PSAvatar: A Point-based Morphable Shape Model for Real-Time Head Avatar\n Creation with 3D Gaussian Splatting","summary":" Despite much progress, creating real-time high-fidelity head avatar is still\ndifficult and existing methods have to trade-off between speed and quality.\n3DMM based methods often fail to model non-facial structures such as eyeglasses\nand hairstyles, while neural implicit models suffer from deformation\ninflexibility and rendering inefficiency.\n Although 3D Gaussian has been demonstrated to possess promising capability\nfor geometry representation and radiance field reconstruction, applying 3D\nGaussian in head avatar creation remains a major challenge since it is\ndifficult for 3D Gaussian to model the head shape variations caused by changing\nposes and expressions. In this paper, we introduce PSAvatar, a novel framework\nfor animatable head avatar creation that utilizes discrete geometric primitive\nto create a parametric morphable shape model and employs 3D Gaussian for fine\ndetail representation and high fidelity rendering. The parametric morphable\nshape model is a Point-based Morphable Shape Model (PMSM) which uses points\ninstead of meshes for 3D representation to achieve enhanced representation\nflexibility. The PMSM first converts the FLAME mesh to points by sampling on\nthe surfaces as well as off the meshes to enable the reconstruction of not only\nsurface-like structures but also complex geometries such as eyeglasses and\nhairstyles. By aligning these points with the head shape in an\nanalysis-by-synthesis manner, the PMSM makes it possible to utilize 3D Gaussian\nfor fine detail representation and appearance modeling, thus enabling the\ncreation of high-fidelity avatars. We show that PSAvatar can reconstruct\nhigh-fidelity head avatars of a variety of subjects and the avatars can be\nanimated in real-time ($\\ge$ 25 fps at a resolution of 512 x 512 )\n","authors":["Zhongyuan Zhao","Zhenyu Bao","Qing Li","Guoping Qiu","Kanglin Liu"],"pdf_url":"https://arxiv.org/pdf/2401.12900v1.pdf","comment":"13 pages, 10 figures"},{"id":"http://arxiv.org/abs/2302.05154v2","updated":"2024-01-23T16:31:56Z","published":"2023-02-10T10:25:12Z","title":"Industrial and Medical Anomaly Detection Through Cycle-Consistent\n Adversarial Networks","summary":" In this study, a new Anomaly Detection (AD) approach for industrial and\nmedical images is proposed. This method leverages the theoretical strengths of\nunsupervised learning and the data availability of both normal and abnormal\nclasses. Indeed, the AD is often formulated as an unsupervised task, implying\nonly normal images during training. These normal images are devoted to be\nreconstructed, through an autoencoder architecture for instance. However, the\ninformation contained in abnormal data, when available, is also valuable for\nthis reconstruction. The model would be able to identify its weaknesses by\nbetter learning how to transform an abnormal (respectively normal) image into a\nnormal (respectively abnormal) one, helping the entire model to learn better\nthan a single normal to normal reconstruction. To address this challenge, the\nproposed method uses Cycle-Generative Adversarial Networks (Cycle-GAN) for\n(ab)normal-to-normal translation. After an input image has been reconstructed\nby the normal generator, an anomaly score quantifies the differences between\nthe input and its reconstruction. Based on a threshold set to satisfy a\nbusiness quality constraint, the input image is then flagged as normal or not.\nThe proposed method is evaluated on industrial and medical datasets. The\nresults demonstrate accurate performance with a zero false negative constraint\ncompared to state-of-the-art methods. The code is available at\nhttps://github.com/ValDelch/CycleGANS-AnomalyDetection.\n","authors":["Arnaud Bougaham","Valentin Delchevalerie","Mohammed El Adoui","Benoît Frénay"],"pdf_url":"https://arxiv.org/pdf/2302.05154v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12888v1","updated":"2024-01-23T16:28:30Z","published":"2024-01-23T16:28:30Z","title":"Data-Centric Evolution in Autonomous Driving: A Comprehensive Survey of\n Big Data System, Data Mining, and Closed-Loop Technologies","summary":" The aspiration of the next generation's autonomous driving (AD) technology\nrelies on the dedicated integration and interaction among intelligent\nperception, prediction, planning, and low-level control. There has been a huge\nbottleneck regarding the upper bound of autonomous driving algorithm\nperformance, a consensus from academia and industry believes that the key to\nsurmount the bottleneck lies in data-centric autonomous driving technology.\nRecent advancement in AD simulation, closed-loop model training, and AD big\ndata engine have gained some valuable experience. However, there is a lack of\nsystematic knowledge and deep understanding regarding how to build efficient\ndata-centric AD technology for AD algorithm self-evolution and better AD big\ndata accumulation. To fill in the identified research gaps, this article will\nclosely focus on reviewing the state-of-the-art data-driven autonomous driving\ntechnologies, with an emphasis on the comprehensive taxonomy of autonomous\ndriving datasets characterized by milestone generations, key features, data\nacquisition settings, etc. Furthermore, we provide a systematic review of the\nexisting benchmark closed-loop AD big data pipelines from the industrial\nfrontier, including the procedure of closed-loop frameworks, key technologies,\nand empirical studies. Finally, the future directions, potential applications,\nlimitations and concerns are discussed to arouse efforts from both academia and\nindustry for promoting the further development of autonomous driving.\n","authors":["Lincan Li","Wei Shao","Wei Dong","Yijun Tian","Kaixiang Yang","Wenjie Zhang"],"pdf_url":"https://arxiv.org/pdf/2401.12888v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12870v1","updated":"2024-01-23T16:04:19Z","published":"2024-01-23T16:04:19Z","title":"Unlocking the Potential: Multi-task Deep Learning for Spaceborne\n Quantitative Monitoring of Fugitive Methane Plumes","summary":" With the intensification of global warming, the monitoring of methane\nemission and detection of gas plumes from landfills have increasingly received\nattention. We decompose methane emission monitoring into three sub-tasks:\nmethane concentration inversion, plume segmentation, and emission rate\nestimation. Conventional algorithms have limitations: methane concentration\ninversion usually uses the matched filter, which is sensitive to global\nspectrum distribution and contains a large amount of noises. There is limited\nresearch on plume segmentation, with many studies resorting to manual\nsegmentation that is likely to be subjective. The estimation of methane\nemission rate often utilizes IME algorithm, which relies on obtaining\nmeteorological measurement data. Using the WENT landfill site in Hong Kong and\nPRISMA hyperspectral satellite imagery, we propose a new deep learning-based\nframework for quantitative monitoring of methane emissions from remote sensing\nimages based on physical simulation. We generate simulated methane plumes using\nlarge eddy simulation (LES) and different concentration maps of fugitive\nemission using the radiative transfer equation (RTE), while combining\naugmentation techniques to create a simulated PRISMA dataset. We train a U-Net\nnetwork for methane concentration inversion, a Mask R-CNN network for methane\nplume segmentation, and a ResNet-50 network for methane emission rate\nestimation. All three deep networks achieve higher validation accuracy compared\nto conventional algorithms. We further respectively combine the first two\nsub-tasks and the last two sub-tasks to design the multi-task learning models -\nMTL-01 and MTL-02, both of which achieve higher accuracy than single-task\nmodels. Our research serves as a demonstration of applying multi-task deep\nlearning to quantitative methane monitoring and can be extended to a broad\nrange of methane monitoring tasks.\n","authors":["Guoxin Si","Shiliang Fu","Wei Yao"],"pdf_url":"https://arxiv.org/pdf/2401.12870v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12862v1","updated":"2024-01-23T15:52:57Z","published":"2024-01-23T15:52:57Z","title":"FedRSU: Federated Learning for Scene Flow Estimation on Roadside Units","summary":" Roadside unit (RSU) can significantly improve the safety and robustness of\nautonomous vehicles through Vehicle-to-Everything (V2X) communication.\nCurrently, the usage of a single RSU mainly focuses on real-time inference and\nV2X collaboration, while neglecting the potential value of the high-quality\ndata collected by RSU sensors. Integrating the vast amounts of data from\nnumerous RSUs can provide a rich source of data for model training. However,\nthe absence of ground truth annotations and the difficulty of transmitting\nenormous volumes of data are two inevitable barriers to fully exploiting this\nhidden value. In this paper, we introduce FedRSU, an innovative federated\nlearning framework for self-supervised scene flow estimation. In FedRSU, we\npresent a recurrent self-supervision training paradigm, where for each RSU, the\nscene flow prediction of points at every timestamp can be supervised by its\nsubsequent future multi-modality observation. Another key component of FedRSU\nis federated learning, where multiple devices collaboratively train an ML model\nwhile keeping the training data local and private. With the power of the\nrecurrent self-supervised learning paradigm, FL is able to leverage innumerable\nunderutilized data from RSU. To verify the FedRSU framework, we construct a\nlarge-scale multi-modality dataset RSU-SF. The dataset consists of 17 RSU\nclients, covering various scenarios, modalities, and sensor settings. Based on\nRSU-SF, we show that FedRSU can greatly improve model performance in ITS and\nprovide a comprehensive benchmark under diverse FL scenarios. To the best of\nour knowledge, we provide the first real-world LiDAR-camera multi-modal dataset\nand benchmark for the FL community.\n","authors":["Shaoheng Fang","Rui Ye","Wenhao Wang","Zuhong Liu","Yuxiao Wang","Yafei Wang","Siheng Chen","Yanfeng Wang"],"pdf_url":"https://arxiv.org/pdf/2401.12862v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.14391v4","updated":"2024-01-23T15:52:28Z","published":"2023-04-27T17:55:13Z","title":"Energy-based Models are Zero-Shot Planners for Compositional Scene\n Rearrangement","summary":" Language is compositional; an instruction can express multiple relation\nconstraints to hold among objects in a scene that a robot is tasked to\nrearrange. Our focus in this work is an instructable scene-rearranging\nframework that generalizes to longer instructions and to spatial concept\ncompositions never seen at training time. We propose to represent\nlanguage-instructed spatial concepts with energy functions over relative object\narrangements. A language parser maps instructions to corresponding energy\nfunctions and an open-vocabulary visual-language model grounds their arguments\nto relevant objects in the scene. We generate goal scene configurations by\ngradient descent on the sum of energy functions, one per language predicate in\nthe instruction. Local vision-based policies then re-locate objects to the\ninferred goal locations. We test our model on established instruction-guided\nmanipulation benchmarks, as well as benchmarks of compositional instructions we\nintroduce. We show our model can execute highly compositional instructions\nzero-shot in simulation and in the real world. It outperforms\nlanguage-to-action reactive policies and Large Language Model planners by a\nlarge margin, especially for long instructions that involve compositions of\nmultiple spatial concepts. Simulation and real-world robot execution videos, as\nwell as our code and datasets are publicly available on our website:\nhttps://ebmplanner.github.io.\n","authors":["Nikolaos Gkanatsios","Ayush Jain","Zhou Xian","Yunchu Zhang","Christopher Atkeson","Katerina Fragkiadaki"],"pdf_url":"https://arxiv.org/pdf/2304.14391v4.pdf","comment":"First two authors contributed equally | RSS 2023"},{"id":"http://arxiv.org/abs/2309.01141v4","updated":"2024-01-23T15:51:18Z","published":"2023-09-03T11:32:28Z","title":"VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual\n Grounders","summary":" Large-scale text-to-image diffusion models have shown impressive capabilities\nfor generative tasks by leveraging strong vision-language alignment from\npre-training. However, most vision-language discriminative tasks require\nextensive fine-tuning on carefully-labeled datasets to acquire such alignment,\nwith great cost in time and computing resources. In this work, we explore\ndirectly applying a pre-trained generative diffusion model to the challenging\ndiscriminative task of visual grounding without any fine-tuning and additional\ntraining dataset. Specifically, we propose VGDiffZero, a simple yet effective\nzero-shot visual grounding framework based on text-to-image diffusion models.\nWe also design a comprehensive region-scoring method considering both global\nand local contexts of each isolated proposal. Extensive experiments on RefCOCO,\nRefCOCO+, and RefCOCOg show that VGDiffZero achieves strong performance on\nzero-shot visual grounding. Our code is available at\nhttps://github.com/xuyang-liu16/VGDiffZero.\n","authors":["Xuyang Liu","Siteng Huang","Yachen Kang","Honggang Chen","Donglin Wang"],"pdf_url":"https://arxiv.org/pdf/2309.01141v4.pdf","comment":"Accepted by ICASSP 2024"},{"id":"http://arxiv.org/abs/2401.12851v1","updated":"2024-01-23T15:35:50Z","published":"2024-01-23T15:35:50Z","title":"Classification of grapevine varieties using UAV hyperspectral imaging","summary":" The classification of different grapevine varieties is a relevant phenotyping\ntask in Precision Viticulture since it enables estimating the growth of\nvineyard rows dedicated to different varieties, among other applications\nconcerning the wine industry. This task can be performed with destructive\nmethods that require time-consuming tasks, including data collection and\nanalysis in the laboratory. However, Unmanned Aerial Vehicles (UAV) provide a\nmore efficient and less prohibitive approach to collecting hyperspectral data,\ndespite acquiring noisier data. Therefore, the first task is the processing of\nthese data to correct and downsample large amounts of data. In addition, the\nhyperspectral signatures of grape varieties are very similar. In this work, a\nConvolutional Neural Network (CNN) is proposed for classifying seventeen\nvarieties of red and white grape variants. Rather than classifying single\nsamples, these are processed together with their neighbourhood. Hence, the\nextraction of spatial and spectral features is addressed with 1) a spatial\nattention layer and 2) Inception blocks. The pipeline goes from processing to\ndataset elaboration, finishing with the training phase. The fitted model is\nevaluated in terms of response time, accuracy and data separability, and\ncompared with other state-of-the-art CNNs for classifying hyperspectral data.\nOur network was proven to be much more lightweight with a reduced number of\ninput bands, a lower number of trainable weights and therefore, reduced\ntraining time. Despite this, the evaluated metrics showed much better results\nfor our network (~99% overall accuracy), in comparison with previous works\nbarely achieving 81% OA.\n","authors":["Alfonso López","Carlos Javier Ogayar","Francisco Ramón Feito","Joaquim João Sousa"],"pdf_url":"https://arxiv.org/pdf/2401.12851v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.01651v3","updated":"2024-01-23T15:31:17Z","published":"2024-01-03T10:08:40Z","title":"AIGCBench: Comprehensive Evaluation of Image-to-Video Content Generated\n by AI","summary":" The burgeoning field of Artificial Intelligence Generated Content (AIGC) is\nwitnessing rapid advancements, particularly in video generation. This paper\nintroduces AIGCBench, a pioneering comprehensive and scalable benchmark\ndesigned to evaluate a variety of video generation tasks, with a primary focus\non Image-to-Video (I2V) generation. AIGCBench tackles the limitations of\nexisting benchmarks, which suffer from a lack of diverse datasets, by including\na varied and open-domain image-text dataset that evaluates different\nstate-of-the-art algorithms under equivalent conditions. We employ a novel text\ncombiner and GPT-4 to create rich text prompts, which are then used to generate\nimages via advanced Text-to-Image models. To establish a unified evaluation\nframework for video generation tasks, our benchmark includes 11 metrics\nspanning four dimensions to assess algorithm performance. These dimensions are\ncontrol-video alignment, motion effects, temporal consistency, and video\nquality. These metrics are both reference video-dependent and video-free,\nensuring a comprehensive evaluation strategy. The evaluation standard proposed\ncorrelates well with human judgment, providing insights into the strengths and\nweaknesses of current I2V algorithms. The findings from our extensive\nexperiments aim to stimulate further research and development in the I2V field.\nAIGCBench represents a significant step toward creating standardized benchmarks\nfor the broader AIGC landscape, proposing an adaptable and equitable framework\nfor future assessments of video generation tasks. We have open-sourced the\ndataset and evaluation code on the project website:\nhttps://www.benchcouncil.org/AIGCBench.\n","authors":["Fanda Fan","Chunjie Luo","Wanling Gao","Jianfeng Zhan"],"pdf_url":"https://arxiv.org/pdf/2401.01651v3.pdf","comment":"Accepted to BenchCouncil Transactions on Benchmarks, Standards and\n Evaluations (TBench)"},{"id":"http://arxiv.org/abs/2401.12074v2","updated":"2024-01-23T15:23:03Z","published":"2024-01-22T16:14:26Z","title":"DeepCERES: A Deep learning method for cerebellar lobule segmentation\n using ultra-high resolution multimodal MRI","summary":" This paper introduces a novel multimodal and high-resolution human brain\ncerebellum lobule segmentation method. Unlike current tools that operate at\nstandard resolution ($1 \\text{ mm}^{3}$) or using mono-modal data, the proposed\nmethod improves cerebellum lobule segmentation through the use of a multimodal\nand ultra-high resolution ($0.125 \\text{ mm}^{3}$) training dataset. To develop\nthe method, first, a database of semi-automatically labelled cerebellum lobules\nwas created to train the proposed method with ultra-high resolution T1 and T2\nMR images. Then, an ensemble of deep networks has been designed and developed,\nallowing the proposed method to excel in the complex cerebellum lobule\nsegmentation task, improving precision while being memory efficient. Notably,\nour approach deviates from the traditional U-Net model by exploring alternative\narchitectures. We have also integrated deep learning with classical machine\nlearning methods incorporating a priori knowledge from multi-atlas\nsegmentation, which improved precision and robustness. Finally, a new online\npipeline, named DeepCERES, has been developed to make available the proposed\nmethod to the scientific community requiring as input only a single T1 MR image\nat standard resolution.\n","authors":["Sergio Morell-Ortega","Marina Ruiz-Perez","Marien Gadea","Roberto Vivo-Hernando","Gregorio Rubio","Fernando Aparici","Maria de la Iglesia-Vaya","Gwenaelle Catheline","Pierrick Coupé","José V. Manjón"],"pdf_url":"https://arxiv.org/pdf/2401.12074v2.pdf","comment":"20 pages"},{"id":"http://arxiv.org/abs/2310.00367v2","updated":"2024-01-23T15:20:33Z","published":"2023-09-30T13:15:49Z","title":"AutomaTikZ: Text-Guided Synthesis of Scientific Vector Graphics with\n TikZ","summary":" Generating bitmap graphics from text has gained considerable attention, yet\nfor scientific figures, vector graphics are often preferred. Given that vector\ngraphics are typically encoded using low-level graphics primitives, generating\nthem directly is difficult. To address this, we propose the use of TikZ, a\nwell-known abstract graphics language that can be compiled to vector graphics,\nas an intermediate representation of scientific figures. TikZ offers\nhuman-oriented, high-level commands, thereby facilitating conditional language\nmodeling with any large language model. To this end, we introduce DaTikZ, the\nfirst large-scale TikZ dataset consisting of 120k TikZ drawings aligned with\ncaptions. We fine-tune LLaMA on DaTikZ, as well as our new model CLiMA, which\naugments LLaMA with multimodal CLIP embeddings. In both human and automatic\nevaluation, CLiMA and LLaMA outperform commercial GPT-4 and Claude 2 in terms\nof similarity to human-created figures, with CLiMA additionally improving\ntext-image alignment. Our detailed analysis shows that all models generalize\nwell and are not susceptible to memorization. GPT-4 and Claude 2, however, tend\nto generate more simplistic figures compared to both humans and our models. We\nmake our framework, AutomaTikZ, along with model weights and datasets, publicly\navailable.\n","authors":["Jonas Belouadi","Anne Lauscher","Steffen Eger"],"pdf_url":"https://arxiv.org/pdf/2310.00367v2.pdf","comment":"Accepted at ICLR 2024 (poster); Project Page:\n https://github.com/potamides/AutomaTikZ"},{"id":"http://arxiv.org/abs/2401.12835v1","updated":"2024-01-23T15:18:20Z","published":"2024-01-23T15:18:20Z","title":"SGTR+: End-to-end Scene Graph Generation with Transformer","summary":" Scene Graph Generation (SGG) remains a challenging visual understanding task\ndue to its compositional property. Most previous works adopt a bottom-up,\ntwo-stage or point-based, one-stage approach, which often suffers from high\ntime complexity or suboptimal designs. In this work, we propose a novel SGG\nmethod to address the aforementioned issues, formulating the task as a\nbipartite graph construction problem. To address the issues above, we create a\ntransformer-based end-to-end framework to generate the entity and entity-aware\npredicate proposal set, and infer directed edges to form relation triplets.\nMoreover, we design a graph assembling module to infer the connectivity of the\nbipartite scene graph based on our entity-aware structure, enabling us to\ngenerate the scene graph in an end-to-end manner. Based on bipartite graph\nassembling paradigm, we further propose a new technical design to address the\nefficacy of entity-aware modeling and optimization stability of graph\nassembling. Equipped with the enhanced entity-aware design, our method achieves\noptimal performance and time-complexity. Extensive experimental results show\nthat our design is able to achieve the state-of-the-art or comparable\nperformance on three challenging benchmarks, surpassing most of the existing\napproaches and enjoying higher efficiency in inference. Code is available:\nhttps://github.com/Scarecrow0/SGTR\n","authors":["Rongjie Li","Songyang Zhang","Xuming He"],"pdf_url":"https://arxiv.org/pdf/2401.12835v1.pdf","comment":"Accepted by TPAMI: https://ieeexplore.ieee.org/document/10315230"},{"id":"http://arxiv.org/abs/2401.12820v1","updated":"2024-01-23T14:53:32Z","published":"2024-01-23T14:53:32Z","title":"DatUS^2: Data-driven Unsupervised Semantic Segmentation with Pre-trained\n Self-supervised Vision Transformer","summary":" Successive proposals of several self-supervised training schemes continue to\nemerge, taking one step closer to developing a universal foundation model. In\nthis process, the unsupervised downstream tasks are recognized as one of the\nevaluation methods to validate the quality of visual features learned with a\nself-supervised training scheme. However, unsupervised dense semantic\nsegmentation has not been explored as a downstream task, which can utilize and\nevaluate the quality of semantic information introduced in patch-level feature\nrepresentations during self-supervised training of a vision transformer.\nTherefore, this paper proposes a novel data-driven approach for unsupervised\nsemantic segmentation (DatUS^2) as a downstream task. DatUS^2 generates\nsemantically consistent and dense pseudo annotate segmentation masks for the\nunlabeled image dataset without using any visual-prior or synchronized data. We\ncompare these pseudo-annotated segmentation masks with ground truth masks for\nevaluating recent self-supervised training schemes to learn shared semantic\nproperties at the patch level and discriminative semantic properties at the\nsegment level. Finally, we evaluate existing state-of-the-art self-supervised\ntraining schemes with our proposed downstream task, i.e., DatUS^2. Also, the\nbest version of DatUS^2 outperforms the existing state-of-the-art method for\nthe unsupervised dense semantic segmentation task with 15.02% MiOU and 21.47%\nPixel accuracy on the SUIM dataset. It also achieves a competitive level of\naccuracy for a large-scale and complex dataset, i.e., the COCO dataset.\n","authors":["Sonal Kumar","Arijit Sur","Rashmi Dutta Baruah"],"pdf_url":"https://arxiv.org/pdf/2401.12820v1.pdf","comment":"The manuscript contains 13 pages, 9 figures and 7 tables"},{"id":"http://arxiv.org/abs/2308.14190v2","updated":"2024-01-23T14:51:41Z","published":"2023-08-27T19:43:43Z","title":"Score-Based Generative Models for PET Image Reconstruction","summary":" Score-based generative models have demonstrated highly promising results for\nmedical image reconstruction tasks in magnetic resonance imaging or computed\ntomography. However, their application to Positron Emission Tomography (PET) is\nstill largely unexplored. PET image reconstruction involves a variety of\nchallenges, including Poisson noise with high variance and a wide dynamic\nrange. To address these challenges, we propose several PET-specific adaptations\nof score-based generative models. The proposed framework is developed for both\n2D and 3D PET. In addition, we provide an extension to guided reconstruction\nusing magnetic resonance images. We validate the approach through extensive 2D\nand 3D $\\textit{in-silico}$ experiments with a model trained on\npatient-realistic data without lesions, and evaluate on data without lesions as\nwell as out-of-distribution data with lesions. This demonstrates the proposed\nmethod's robustness and significant potential for improved PET reconstruction.\n","authors":["Imraj RD Singh","Alexander Denker","Riccardo Barbano","Željko Kereta","Bangti Jin","Kris Thielemans","Peter Maass","Simon Arridge"],"pdf_url":"https://arxiv.org/pdf/2308.14190v2.pdf","comment":"Accepted for publication at the Journal of Machine Learning for\n Biomedical Imaging (MELBA) https://melba-journal.org/2024:001"},{"id":"http://arxiv.org/abs/2401.12761v1","updated":"2024-01-23T13:43:17Z","published":"2024-01-23T13:43:17Z","title":"MUSES: The Multi-Sensor Semantic Perception Dataset for Driving under\n Uncertainty","summary":" Achieving level-5 driving automation in autonomous vehicles necessitates a\nrobust semantic visual perception system capable of parsing data from different\nsensors across diverse conditions. However, existing semantic perception\ndatasets often lack important non-camera modalities typically used in\nautonomous vehicles, or they do not exploit such modalities to aid and improve\nsemantic annotations in challenging conditions. To address this, we introduce\nMUSES, the MUlti-SEnsor Semantic perception dataset for driving in adverse\nconditions under increased uncertainty. MUSES includes synchronized multimodal\nrecordings with 2D panoptic annotations for 2500 images captured under diverse\nweather and illumination. The dataset integrates a frame camera, a lidar, a\nradar, an event camera, and an IMU/GNSS sensor. Our new two-stage panoptic\nannotation protocol captures both class-level and instance-level uncertainty in\nthe ground truth and enables the novel task of uncertainty-aware panoptic\nsegmentation we introduce, along with standard semantic and panoptic\nsegmentation. MUSES proves both effective for training and challenging for\nevaluating models under diverse visual conditions, and it opens new avenues for\nresearch in multimodal and uncertainty-aware dense semantic perception. Our\ndataset and benchmark will be made publicly available.\n","authors":["Tim Brödermann","David Bruggemann","Christos Sakaridis","Kevin Ta","Odysseas Liagouris","Jason Corkill","Luc Van Gool"],"pdf_url":"https://arxiv.org/pdf/2401.12761v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12751v1","updated":"2024-01-23T13:30:43Z","published":"2024-01-23T13:30:43Z","title":"PSDF: Prior-Driven Neural Implicit Surface Learning for Multi-view\n Reconstruction","summary":" Surface reconstruction has traditionally relied on the Multi-View Stereo\n(MVS)-based pipeline, which often suffers from noisy and incomplete geometry.\nThis is due to that although MVS has been proven to be an effective way to\nrecover the geometry of the scenes, especially for locally detailed areas with\nrich textures, it struggles to deal with areas with low texture and large\nvariations of illumination where the photometric consistency is unreliable.\nRecently, Neural Implicit Surface Reconstruction (NISR) combines surface\nrendering and volume rendering techniques and bypasses the MVS as an\nintermediate step, which has emerged as a promising alternative to overcome the\nlimitations of traditional pipelines. While NISR has shown impressive results\non simple scenes, it remains challenging to recover delicate geometry from\nuncontrolled real-world scenes which is caused by its underconstrained\noptimization. To this end, the framework PSDF is proposed which resorts to\nexternal geometric priors from a pretrained MVS network and internal geometric\npriors inherent in the NISR model to facilitate high-quality neural implicit\nsurface learning. Specifically, the visibility-aware feature consistency loss\nand depth prior-assisted sampling based on external geometric priors are\nintroduced. These proposals provide powerfully geometric consistency\nconstraints and aid in locating surface intersection points, thereby\nsignificantly improving the accuracy and delicate reconstruction of NISR.\nMeanwhile, the internal prior-guided importance rendering is presented to\nenhance the fidelity of the reconstructed surface mesh by mitigating the biased\nrendering issue in NISR. Extensive experiments on the Tanks and Temples dataset\nshow that PSDF achieves state-of-the-art performance on complex uncontrolled\nscenes.\n","authors":["Wanjuan Su","Chen Zhang","Qingshan Xu","Wenbing Tao"],"pdf_url":"https://arxiv.org/pdf/2401.12751v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12743v1","updated":"2024-01-23T13:20:57Z","published":"2024-01-23T13:20:57Z","title":"Correlation-Embedded Transformer Tracking: A Single-Branch Framework","summary":" Developing robust and discriminative appearance models has been a\nlong-standing research challenge in visual object tracking. In the prevalent\nSiamese-based paradigm, the features extracted by the Siamese-like networks are\noften insufficient to model the tracked targets and distractor objects, thereby\nhindering them from being robust and discriminative simultaneously. While most\nSiamese trackers focus on designing robust correlation operations, we propose a\nnovel single-branch tracking framework inspired by the transformer. Unlike the\nSiamese-like feature extraction, our tracker deeply embeds cross-image feature\ncorrelation in multiple layers of the feature network. By extensively matching\nthe features of the two images through multiple layers, it can suppress\nnon-target features, resulting in target-aware feature extraction. The output\nfeatures can be directly used for predicting target locations without\nadditional correlation steps. Thus, we reformulate the two-branch Siamese\ntracking as a conceptually simple, fully transformer-based Single-Branch\nTracking pipeline, dubbed SBT. After conducting an in-depth analysis of the SBT\nbaseline, we summarize many effective design principles and propose an improved\ntracker dubbed SuperSBT. SuperSBT adopts a hierarchical architecture with a\nlocal modeling layer to enhance shallow-level features. A unified relation\nmodeling is proposed to remove complex handcrafted layer pattern designs.\nSuperSBT is further improved by masked image modeling pre-training, integrating\ntemporal modeling, and equipping with dedicated prediction heads. Thus,\nSuperSBT outperforms the SBT baseline by 4.7%,3.0%, and 4.5% AUC scores in\nLaSOT, TrackingNet, and GOT-10K. Notably, SuperSBT greatly raises the speed of\nSBT from 37 FPS to 81 FPS. Extensive experiments show that our method achieves\nsuperior results on eight VOT benchmarks.\n","authors":["Fei Xie","Wankou Yang","Chunyu Wang","Lei Chu","Yue Cao","Chao Ma","Wenjun Zeng"],"pdf_url":"https://arxiv.org/pdf/2401.12743v1.pdf","comment":"14 pages"},{"id":"http://arxiv.org/abs/2307.03212v2","updated":"2024-01-23T13:15:31Z","published":"2023-07-06T16:38:43Z","title":"Region-Wise Attentive Multi-View Representation Learning for Urban\n Region Embeddings","summary":" Urban region embedding is an important and yet highly challenging issue due\nto the complexity and constantly changing nature of urban data. To address the\nchallenges, we propose a Region-Wise Multi-View Representation Learning (ROMER)\nto capture multi-view dependencies and learn expressive representations of\nurban regions without the constraints of rigid neighbourhood region conditions.\nOur model focus on learn urban region representation from multi-source urban\ndata. First, we capture the multi-view correlations from mobility flow\npatterns, POI semantics and check-in dynamics. Then, we adopt global graph\nattention networks to learn similarity of any two vertices in graphs. To\ncomprehensively consider and share features of multiple views, a two-stage\nfusion module is further proposed to learn weights with external attention to\nfuse multi-view embeddings. Extensive experiments for two downstream tasks on\nreal-world datasets demonstrate that our model outperforms state-of-the-art\nmethods by up to 17\\% improvement.\n","authors":["Weiliang Chan","Qianqian Ren"],"pdf_url":"https://arxiv.org/pdf/2307.03212v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12736v1","updated":"2024-01-23T13:13:45Z","published":"2024-01-23T13:13:45Z","title":"Shift-ConvNets: Small Convolutional Kernel with Large Kernel Effects","summary":" Recent studies reveal that the remarkable performance of Vision transformers\n(ViTs) benefits from large receptive fields. For this reason, the large\nconvolutional kernel design becomes an ideal solution to make Convolutional\nNeural Networks (CNNs) great again. However, the typical large convolutional\nkernels turn out to be hardware-unfriendly operators, resulting in discount\ncompatibility of various hardware platforms. Thus, it is unwise to simply\nenlarge the convolutional kernel size. In this paper, we reveal that small\nconvolutional kernels and convolution operations can achieve the closing\neffects of large kernel sizes. Then, we propose a shift-wise operator that\nensures the CNNs capture long-range dependencies with the help of the sparse\nmechanism, while remaining hardware-friendly. Experimental results show that\nour shift-wise operator significantly improves the accuracy of a regular CNN\nwhile markedly reducing computational requirements. On the ImageNet-1k, our\nshift-wise enhanced CNN model outperforms the state-of-the-art models. Code &\nmodels at https://github.com/lidc54/shift-wiseConv.\n","authors":["Dachong Li","Li Li","Zhuangzhuang Chen","Jianqiang Li"],"pdf_url":"https://arxiv.org/pdf/2401.12736v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12729v1","updated":"2024-01-23T13:02:11Z","published":"2024-01-23T13:02:11Z","title":"Enhancing Object Detection Performance for Small Objects through\n Synthetic Data Generation and Proportional Class-Balancing Technique: A\n Comparative Study in Industrial Scenarios","summary":" Object Detection (OD) has proven to be a significant computer vision method\nin extracting localized class information and has multiple applications in the\nindustry. Although many of the state-of-the-art (SOTA) OD models perform well\non medium and large sized objects, they seem to under perform on small objects.\nIn most of the industrial use cases, it is difficult to collect and annotate\ndata for small objects, as it is time-consuming and prone to human errors.\nAdditionally, those datasets are likely to be unbalanced and often result in an\ninefficient model convergence. To tackle this challenge, this study presents a\nnovel approach that injects additional data points to improve the performance\nof the OD models. Using synthetic data generation, the difficulties in data\ncollection and annotations for small object data points can be minimized and to\ncreate a dataset with balanced distribution. This paper discusses the effects\nof a simple proportional class-balancing technique, to enable better anchor\nmatching of the OD models. A comparison was carried out on the performances of\nthe SOTA OD models: YOLOv5, YOLOv7 and SSD, for combinations of real and\nsynthetic datasets within an industrial use case.\n","authors":["Jibinraj Antony","Vinit Hegiste","Ali Nazeri","Hooman Tavakoli","Snehal Walunj","Christiane Plociennik","Martin Ruskowski"],"pdf_url":"https://arxiv.org/pdf/2401.12729v1.pdf","comment":"Accepted and presented in conference ESAIM23 1st European Symposium\n on Artificial Intelligence in Manufacturing"},{"id":"http://arxiv.org/abs/2401.12725v1","updated":"2024-01-23T12:53:37Z","published":"2024-01-23T12:53:37Z","title":"Two-View Topogram-Based Anatomy-Guided CT Reconstruction for Prospective\n Risk Minimization","summary":" To facilitate a prospective estimation of CT effective dose and risk\nminimization process, a prospective spatial dose estimation and the known\nanatomical structures are expected. To this end, a CT reconstruction method is\nrequired to reconstruct CT volumes from as few projections as possible, i.e. by\nusing the topograms, with anatomical structures as correct as possible. In this\nwork, an optimized CT reconstruction model based on a generative adversarial\nnetwork (GAN) is proposed. The GAN is trained to reconstruct 3D volumes from an\nanterior-posterior and a lateral CT projection. To enhance anatomical\nstructures, a pre-trained organ segmentation network and the 3D perceptual loss\nare applied during the training phase, so that the model can then generate both\norgan-enhanced CT volume and the organ segmentation mask. The proposed method\ncan reconstruct CT volumes with PSNR of 26.49, RMSE of 196.17, and SSIM of\n0.64, compared to 26.21, 201.55 and 0.63 using the baseline method. In terms of\nthe anatomical structure, the proposed method effectively enhances the organ\nshape and boundary and allows for a straight-forward identification of the\nrelevant anatomical structures. We note that conventional reconstruction\nmetrics fail to indicate the enhancement of anatomical structures. In addition\nto such metrics, the evaluation is expanded with assessing the organ\nsegmentation performance. The average organ dice of the proposed method is 0.71\ncompared with 0.63 in baseline model, indicating the enhancement of anatomical\nstructures.\n","authors":["Chang Liu","Laura Klein","Yixing Huang","Edith Baader","Michael Lell","Marc Kachelrieß","Andreas Maier"],"pdf_url":"https://arxiv.org/pdf/2401.12725v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12694v1","updated":"2024-01-23T11:58:08Z","published":"2024-01-23T11:58:08Z","title":"Pragmatic Communication in Multi-Agent Collaborative Perception","summary":" Collaborative perception allows each agent to enhance its perceptual\nabilities by exchanging messages with others. It inherently results in a\ntrade-off between perception ability and communication costs. Previous works\ntransmit complete full-frame high-dimensional feature maps among agents,\nresulting in substantial communication costs. To promote communication\nefficiency, we propose only transmitting the information needed for the\ncollaborator's downstream task. This pragmatic communication strategy focuses\non three key aspects: i) pragmatic message selection, which selects\ntask-critical parts from the complete data, resulting in spatially and\ntemporally sparse feature vectors; ii) pragmatic message representation, which\nachieves pragmatic approximation of high-dimensional feature vectors with a\ntask-adaptive dictionary, enabling communicating with integer indices; iii)\npragmatic collaborator selection, which identifies beneficial collaborators,\npruning unnecessary communication links. Following this strategy, we first\nformulate a mathematical optimization framework for the\nperception-communication trade-off and then propose PragComm, a multi-agent\ncollaborative perception system with two key components: i) single-agent\ndetection and tracking and ii) pragmatic collaboration. The proposed PragComm\npromotes pragmatic communication and adapts to a wide range of communication\nconditions. We evaluate PragComm for both collaborative 3D object detection and\ntracking tasks in both real-world, V2V4Real, and simulation datasets, OPV2V and\nV2X-SIM2.0. PragComm consistently outperforms previous methods with more than\n32.7K times lower communication volume on OPV2V. Code is available at\ngithub.com/PhyllisH/PragComm.\n","authors":["Yue Hu","Xianghe Pang","Xiaoqi Qin","Yonina C. Eldar","Siheng Chen","Ping Zhang","Wenjun Zhang"],"pdf_url":"https://arxiv.org/pdf/2401.12694v1.pdf","comment":"18 pages"},{"id":"http://arxiv.org/abs/2401.12689v1","updated":"2024-01-23T11:54:09Z","published":"2024-01-23T11:54:09Z","title":"Energy-based Automated Model Evaluation","summary":" The conventional evaluation protocols on machine learning models rely heavily\non a labeled, i.i.d-assumed testing dataset, which is not often present in real\nworld applications. The Automated Model Evaluation (AutoEval) shows an\nalternative to this traditional workflow, by forming a proximal prediction\npipeline of the testing performance without the presence of ground-truth\nlabels. Despite its recent successes, the AutoEval frameworks still suffer from\nan overconfidence issue, substantial storage and computational cost. In that\nregard, we propose a novel measure -- Meta-Distribution Energy (MDE) -- that\nallows the AutoEval framework to be both more efficient and effective. The core\nof the MDE is to establish a meta-distribution statistic, on the information\n(energy) associated with individual samples, then offer a smoother\nrepresentation enabled by energy-based learning. We further provide our\ntheoretical insights by connecting the MDE with the classification loss. We\nprovide extensive experiments across modalities, datasets and different\narchitectural backbones to validate MDE's validity, together with its\nsuperiority compared with prior approaches. We also prove MDE's versatility by\nshowing its seamless integration with large-scale models, and easy adaption to\nlearning scenarios with noisy- or imbalanced- labels.\n","authors":["Ru Peng","Heming Zou","Haobo Wang","Yawen Zeng","Zenan Huang","Junbo Zhao"],"pdf_url":"https://arxiv.org/pdf/2401.12689v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.02246v3","updated":"2024-01-23T11:26:42Z","published":"2023-12-04T14:45:56Z","title":"Conditional Variational Diffusion Models","summary":" Inverse problems aim to determine parameters from observations, a crucial\ntask in engineering and science. Lately, generative models, especially\ndiffusion models, have gained popularity in this area for their ability to\nproduce realistic solutions and their good mathematical properties. Despite\ntheir success, an important drawback of diffusion models is their sensitivity\nto the choice of variance schedule, which controls the dynamics of the\ndiffusion process. Fine-tuning this schedule for specific applications is\ncrucial but time-costly and does not guarantee an optimal result. We propose a\nnovel approach for learning the schedule as part of the training process. Our\nmethod supports probabilistic conditioning on data, provides high-quality\nsolutions, and is flexible, proving able to adapt to different applications\nwith minimum overhead. This approach is tested in two unrelated inverse\nproblems: super-resolution microscopy and quantitative phase imaging, yielding\ncomparable or superior results to previous methods and fine-tuned diffusion\nmodels. We conclude that fine-tuning the schedule by experimentation should be\navoided because it can be learned during training in a stable way that yields\nbetter results.\n","authors":["Gabriel della Maggiora","Luis Alberto Croquevielle","Nikita Deshpande","Harry Horsley","Thomas Heinis","Artur Yakimovich"],"pdf_url":"https://arxiv.org/pdf/2312.02246v3.pdf","comment":"Denoising Diffusion Probabilistic Models, Inverse Problems,\n Generative Models, Super Resolution, Phase Quantification, Variational\n Methods"},{"id":"http://arxiv.org/abs/2401.07709v2","updated":"2024-01-23T11:22:03Z","published":"2024-01-15T14:25:54Z","title":"Towards Efficient Diffusion-Based Image Editing with Instant Attention\n Masks","summary":" Diffusion-based Image Editing (DIE) is an emerging research hot-spot, which\noften applies a semantic mask to control the target area for diffusion-based\nediting. However, most existing solutions obtain these masks via manual\noperations or off-line processing, greatly reducing their efficiency. In this\npaper, we propose a novel and efficient image editing method for Text-to-Image\n(T2I) diffusion models, termed Instant Diffusion Editing(InstDiffEdit). In\nparticular, InstDiffEdit aims to employ the cross-modal attention ability of\nexisting diffusion models to achieve instant mask guidance during the diffusion\nsteps. To reduce the noise of attention maps and realize the full automatics,\nwe equip InstDiffEdit with a training-free refinement scheme to adaptively\naggregate the attention distributions for the automatic yet accurate mask\ngeneration. Meanwhile, to supplement the existing evaluations of DIE, we\npropose a new benchmark called Editing-Mask to examine the mask accuracy and\nlocal editing ability of existing methods. To validate InstDiffEdit, we also\nconduct extensive experiments on ImageNet and Imagen, and compare it with a\nbunch of the SOTA methods. The experimental results show that InstDiffEdit not\nonly outperforms the SOTA methods in both image quality and editing results,\nbut also has a much faster inference speed, i.e., +5 to +6 times.\n","authors":["Siyu Zou","Jiji Tang","Yiyi Zhou","Jing He","Chaoyi Zhao","Rongsheng Zhang","Zhipeng Hu","Xiaoshuai Sun"],"pdf_url":"https://arxiv.org/pdf/2401.07709v2.pdf","comment":"Accepted by AAAI2024"},{"id":"http://arxiv.org/abs/2401.12665v1","updated":"2024-01-23T11:20:03Z","published":"2024-01-23T11:20:03Z","title":"ClipSAM: CLIP and SAM Collaboration for Zero-Shot Anomaly Segmentation","summary":" Recently, foundational models such as CLIP and SAM have shown promising\nperformance for the task of Zero-Shot Anomaly Segmentation (ZSAS). However,\neither CLIP-based or SAM-based ZSAS methods still suffer from non-negligible\nkey drawbacks: 1) CLIP primarily focuses on global feature alignment across\ndifferent inputs, leading to imprecise segmentation of local anomalous parts;\n2) SAM tends to generate numerous redundant masks without proper prompt\nconstraints, resulting in complex post-processing requirements. In this work,\nwe innovatively propose a CLIP and SAM collaboration framework called ClipSAM\nfor ZSAS. The insight behind ClipSAM is to employ CLIP's semantic understanding\ncapability for anomaly localization and rough segmentation, which is further\nused as the prompt constraints for SAM to refine the anomaly segmentation\nresults. In details, we introduce a crucial Unified Multi-scale Cross-modal\nInteraction (UMCI) module for interacting language with visual features at\nmultiple scales of CLIP to reason anomaly positions. Then, we design a novel\nMulti-level Mask Refinement (MMR) module, which utilizes the positional\ninformation as multi-level prompts for SAM to acquire hierarchical levels of\nmasks and merges them. Extensive experiments validate the effectiveness of our\napproach, achieving the optimal segmentation performance on the MVTec-AD and\nVisA datasets.\n","authors":["Shengze Li","Jianjian Cao","Peng Ye","Yuhan Ding","Chongjun Tu","Tao Chen"],"pdf_url":"https://arxiv.org/pdf/2401.12665v1.pdf","comment":"7 pages,6 figures"},{"id":"http://arxiv.org/abs/2401.12648v1","updated":"2024-01-23T10:56:01Z","published":"2024-01-23T10:56:01Z","title":"Consistency Enhancement-Based Deep Multiview Clustering via Contrastive\n Learning","summary":" Multiview clustering (MVC) segregates data samples into meaningful clusters\nby synthesizing information across multiple views. Moreover, deep\nlearning-based methods have demonstrated their strong feature learning\ncapabilities in MVC scenarios. However, effectively generalizing feature\nrepresentations while maintaining consistency is still an intractable problem.\nIn addition, most existing deep clustering methods based on contrastive\nlearning overlook the consistency of the clustering representations during the\nclustering process. In this paper, we show how the above problems can be\novercome and propose a consistent enhancement-based deep MVC method via\ncontrastive learning (CCEC). Specifically, semantic connection blocks are\nincorporated into a feature representation to preserve the consistent\ninformation among multiple views. Furthermore, the representation process for\nclustering is enhanced through spectral clustering, and the consistency across\nmultiple views is improved. Experiments conducted on five datasets demonstrate\nthe effectiveness and superiority of our method in comparison with the\nstate-of-the-art (SOTA) methods. The code for this method can be accessed at\nhttps://anonymous.4open.science/r/CCEC-E84E/.\n","authors":["Hao Yang","Hua Mao","Wai Lok Woo","Jie Chen","Xi Peng"],"pdf_url":"https://arxiv.org/pdf/2401.12648v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12609v1","updated":"2024-01-23T10:07:41Z","published":"2024-01-23T10:07:41Z","title":"Fast Semi-supervised Unmixing using Non-convex Optimization","summary":" In this paper, we introduce a novel linear model tailored for\nsemisupervised/library-based unmixing. Our model incorporates considerations\nfor library mismatch while enabling the enforcement of the abundance sum-to-one\nconstraint (ASC). Unlike conventional sparse unmixing methods, this model\ninvolves nonconvex optimization, presenting significant computational\nchallenges. We demonstrate the efficacy of Alternating Methods of Multipliers\n(ADMM) in cyclically solving these intricate problems. We propose two\nsemisupervised unmixing approaches, each relying on distinct priors applied to\nthe new model in addition to the ASC: sparsity prior and convexity constraint.\nOur experimental results validate that enforcing the convexity constraint\noutperforms the sparsity prior for the endmember library. These results are\ncorroborated across three simulated datasets (accounting for spectral\nvariability and varying pixel purity levels) and the Cuprite dataset.\nAdditionally, our comparison with conventional sparse unmixing methods\nshowcases considerable advantages of our proposed model, which entails\nnonconvex optimization. Notably, our implementations of the proposed\nalgorithms-fast semisupervised unmixing (FaSUn) and sparse unmixing using\nsoft-shrinkage (SUnS)-prove considerably more efficient than traditional sparse\nunmixing methods. SUnS and FaSUn were implemented using PyTorch and provided in\na dedicated Python package called Fast Semisupervised Unmixing (FUnmix), which\nis open-source and available at https://github.com/BehnoodRasti/FUnmix\n","authors":["Behnood Rasti","Alexandre Zouaoui","Julien Mairal","Jocelyn Chanussot"],"pdf_url":"https://arxiv.org/pdf/2401.12609v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12596v1","updated":"2024-01-23T09:49:24Z","published":"2024-01-23T09:49:24Z","title":"UniHDA: Towards Universal Hybrid Domain Adaptation of Image Generators","summary":" Generative domain adaptation has achieved remarkable progress, enabling us to\nadapt a pre-trained generator to a new target domain. However, existing methods\nsimply adapt the generator to a single target domain and are limited to a\nsingle modality, either text-driven or image-driven. Moreover, they are prone\nto overfitting domain-specific attributes, which inevitably compromises\ncross-domain consistency. In this paper, we propose UniHDA, a unified and\nversatile framework for generative hybrid domain adaptation with multi-modal\nreferences from multiple domains. We use CLIP encoder to project multi-modal\nreferences into a unified embedding space and then linear interpolate the\ndirection vectors from multiple target domains to achieve hybrid domain\nadaptation. To ensure the cross-domain consistency, we propose a novel\ncross-domain spatial structure (CSS) loss that maintains detailed spatial\nstructure information between source and target generator. Experiments show\nthat the adapted generator can synthesise realistic images with various\nattribute compositions. Additionally, our framework is versatile to multiple\ngenerators, \\eg, StyleGAN2 and Diffusion Models.\n","authors":["Hengjia Li","Yang Liu","Yuqi Lin","Zhanwei Zhang","Yibo Zhao","weihang Pan","Tu Zheng","Zheng Yang","Yuchun Jiang","Boxi Wu","Deng Cai"],"pdf_url":"https://arxiv.org/pdf/2401.12596v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.08673v2","updated":"2024-01-23T09:47:26Z","published":"2023-12-14T06:17:15Z","title":"Segment Beyond View: Handling Partially Missing Modality for\n Audio-Visual Semantic Segmentation","summary":" Augmented Reality (AR) devices, emerging as prominent mobile interaction\nplatforms, face challenges in user safety, particularly concerning oncoming\nvehicles. While some solutions leverage onboard camera arrays, these cameras\noften have limited field-of-view (FoV) with front or downward perspectives.\nAddressing this, we propose a new out-of-view semantic segmentation task and\nSegment Beyond View (SBV), a novel audio-visual semantic segmentation method.\nSBV supplements the visual modality, which miss the information beyond FoV,\nwith the auditory information using a teacher-student distillation model\n(Omni2Ego). The model consists of a vision teacher utilising panoramic\ninformation, an auditory teacher with 8-channel audio, and an audio-visual\nstudent that takes views with limited FoV and binaural audio as input and\nproduce semantic segmentation for objects outside FoV. SBV outperforms existing\nmodels in comparative evaluations and shows a consistent performance across\nvarying FoV ranges and in monaural audio settings.\n","authors":["Renjie Wu","Hu Wang","Feras Dayoub","Hsiang-Ting Chen"],"pdf_url":"https://arxiv.org/pdf/2312.08673v2.pdf","comment":"Accepted by AAAI-24"},{"id":"http://arxiv.org/abs/2401.12592v1","updated":"2024-01-23T09:47:13Z","published":"2024-01-23T09:47:13Z","title":"RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from\n RGB-D Videos","summary":" We introduce a new RGB-D object dataset captured in the wild called\nWildRGB-D. Unlike most existing real-world object-centric datasets which only\ncome with RGB capturing, the direct capture of the depth channel allows better\n3D annotations and broader downstream applications. WildRGB-D comprises\nlarge-scale category-level RGB-D object videos, which are taken using an iPhone\nto go around the objects in 360 degrees. It contains around 8500 recorded\nobjects and nearly 20000 RGB-D videos across 46 common object categories. These\nvideos are taken with diverse cluttered backgrounds with three setups to cover\nas many real-world scenarios as possible: (i) a single object in one video;\n(ii) multiple objects in one video; and (iii) an object with a static hand in\none video. The dataset is annotated with object masks, real-world scale camera\nposes, and reconstructed aggregated point clouds from RGBD videos. We benchmark\nfour tasks with WildRGB-D including novel view synthesis, camera pose\nestimation, object 6d pose estimation, and object surface reconstruction. Our\nexperiments show that the large-scale capture of RGB-D objects provides a large\npotential to advance 3D object learning. Our project page is\nhttps://wildrgbd.github.io/.\n","authors":["Hongchi Xia","Yang Fu","Sifei Liu","Xiaolong Wang"],"pdf_url":"https://arxiv.org/pdf/2401.12592v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08276v2","updated":"2024-01-23T09:42:41Z","published":"2023-10-12T12:28:47Z","title":"Direction-Oriented Visual-semantic Embedding Model for Remote Sensing\n Image-text Retrieval","summary":" Image-text retrieval has developed rapidly in recent years. However, it is\nstill a challenge in remote sensing due to visual-semantic imbalance, which\nleads to incorrect matching of non-semantic visual and textual features. To\nsolve this problem, we propose a novel Direction-Oriented Visual-semantic\nEmbedding Model (DOVE) to mine the relationship between vision and language.\nOur highlight is to conduct visual and textual representations in latent space,\ndirecting them as close as possible to a redundancy-free regional visual\nrepresentation. Concretely, a Regional-Oriented Attention Module (ROAM)\nadaptively adjusts the distance between the final visual and textual embeddings\nin the latent semantic space, oriented by regional visual features. Meanwhile,\na lightweight Digging Text Genome Assistant (DTGA) is designed to expand the\nrange of tractable textual representation and enhance global word-level\nsemantic connections using less attention operations. Ultimately, we exploit a\nglobal visual-semantic constraint to reduce single visual dependency and serve\nas an external constraint for the final visual and textual representations. The\neffectiveness and superiority of our method are verified by extensive\nexperiments including parameter evaluation, quantitative comparison, ablation\nstudies and visual analysis, on two benchmark datasets, RSICD and RSITMD.\n","authors":["Qing Ma","Jiancheng Pan","Cong Bai"],"pdf_url":"https://arxiv.org/pdf/2310.08276v2.pdf","comment":"14 pages, 11 figures"},{"id":"http://arxiv.org/abs/2401.12587v1","updated":"2024-01-23T09:37:58Z","published":"2024-01-23T09:37:58Z","title":"Fast Implicit Neural Representation Image Codec in Resource-limited\n Devices","summary":" Displaying high-quality images on edge devices, such as augmented reality\ndevices, is essential for enhancing the user experience. However, these devices\noften face power consumption and computing resource limitations, making it\nchallenging to apply many deep learning-based image compression algorithms in\nthis field. Implicit Neural Representation (INR) for image compression is an\nemerging technology that offers two key benefits compared to cutting-edge\nautoencoder models: low computational complexity and parameter-free decoding.\nIt also outperforms many traditional and early neural compression methods in\nterms of quality. In this study, we introduce a new Mixed Autoregressive Model\n(MARM) to significantly reduce the decoding time for the current INR codec,\nalong with a new synthesis network to enhance reconstruction quality. MARM\nincludes our proposed Autoregressive Upsampler (ARU) blocks, which are highly\ncomputationally efficient, and ARM from previous work to balance decoding time\nand reconstruction quality. We also propose enhancing ARU's performance using a\ncheckerboard two-stage decoding strategy. Moreover, the ratio of different\nmodules can be adjusted to maintain a balance between quality and speed.\nComprehensive experiments demonstrate that our method significantly improves\ncomputational efficiency while preserving image quality. With different\nparameter settings, our method can outperform popular AE-based codecs in\nconstrained environments in terms of both quality and decoding time, or achieve\nstate-of-the-art reconstruction quality compared to other INR codecs.\n","authors":["Xiang Liu","Jiahong Chen","Bin Chen","Zimo Liu","Baoyi An","Shu-Tao Xia"],"pdf_url":"https://arxiv.org/pdf/2401.12587v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.06934v2","updated":"2024-01-23T09:23:40Z","published":"2023-12-12T02:10:16Z","title":"Toward Real Text Manipulation Detection: New Dataset and New Solution","summary":" With the surge in realistic text tampering, detecting fraudulent text in\nimages has gained prominence for maintaining information security. However, the\nhigh costs associated with professional text manipulation and annotation limit\nthe availability of real-world datasets, with most relying on synthetic\ntampering, which inadequately replicates real-world tampering attributes. To\naddress this issue, we present the Real Text Manipulation (RTM) dataset,\nencompassing 14,250 text images, which include 5,986 manually and 5,258\nautomatically tampered images, created using a variety of techniques, alongside\n3,006 unaltered text images for evaluating solution stability. Our evaluations\nindicate that existing methods falter in text forgery detection on the RTM\ndataset. We propose a robust baseline solution featuring a Consistency-aware\nAggregation Hub and a Gated Cross Neighborhood-attention Fusion module for\nefficient multi-modal information fusion, supplemented by a Tampered-Authentic\nContrastive Learning module during training, enriching feature representation\ndistinction. This framework, extendable to other dual-stream architectures,\ndemonstrated notable localization performance improvements of 7.33% and 6.38%\non manual and overall manipulations, respectively. Our contributions aim to\npropel advancements in real-world text tampering detection. Code and dataset\nwill be made available at https://github.com/DrLuo/RTM\n","authors":["Dongliang Luo","Yuliang Liu","Rui Yang","Xianjin Liu","Jishen Zeng","Yu Zhou","Xiang Bai"],"pdf_url":"https://arxiv.org/pdf/2312.06934v2.pdf","comment":"The paper needs to be improved"},{"id":"http://arxiv.org/abs/2306.06075v2","updated":"2024-01-23T09:06:46Z","published":"2023-05-26T13:41:35Z","title":"DeepSeaNet: Improving Underwater Object Detection using EfficientDet","summary":" Marine animals and deep underwater objects are difficult to recognize and\nmonitor for safety of aquatic life. There is an increasing challenge when the\nwater is saline with granular particles and impurities. In such natural\nadversarial environment, traditional approaches like CNN start to fail and are\nexpensive to compute. This project involves implementing and evaluating various\nobject detection models, including EfficientDet, YOLOv5, YOLOv8, and\nDetectron2, on an existing annotated underwater dataset, called the\nBrackish-Dataset. The dataset comprises annotated image sequences of fish,\ncrabs, starfish, and other aquatic animals captured in Limfjorden water with\nlimited visibility. The aim of this research project is to study the efficiency\nof newer models on the same dataset and contrast them with the previous results\nbased on accuracy and inference time. Firstly, I compare the results of YOLOv3\n(31.10% mean Average Precision (mAP)), YOLOv4 (83.72% mAP), YOLOv5 (97.6%),\nYOLOv8 (98.20%), EfficientDet (98.56% mAP) and Detectron2 (95.20% mAP) on the\nsame dataset. Secondly, I provide a modified BiSkFPN mechanism (BiFPN neck with\nskip connections) to perform complex feature fusion in adversarial noise which\nmakes modified EfficientDet robust to perturbations. Third, analyzed the effect\non accuracy of EfficientDet (98.63% mAP) and YOLOv5 by adversarial learning\n(98.04% mAP). Last, I provide class activation map based explanations (CAM) for\nthe two models to promote Explainability in black box models. Overall, the\nresults indicate that modified EfficientDet achieved higher accuracy with\nfive-fold cross validation than the other models with 88.54% IoU of feature\nmaps.\n","authors":["Sanyam Jain"],"pdf_url":"https://arxiv.org/pdf/2306.06075v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.15939v3","updated":"2024-01-23T08:54:59Z","published":"2023-11-27T15:46:47Z","title":"Unleashing the Power of Prompt-driven Nucleus Instance Segmentation","summary":" Nucleus instance segmentation in histology images is crucial for a broad\nspectrum of clinical applications. Current dominant algorithms rely on\nregression of nuclear proxy maps. Distinguishing nucleus instances from the\nestimated maps requires carefully curated post-processing, which is error-prone\nand parameter-sensitive. Recently, the Segment Anything Model (SAM) has earned\nhuge attention in medical image segmentation, owing to its impressive\ngeneralization ability and promptable property. Nevertheless, its potential on\nnucleus instance segmentation remains largely underexplored. In this paper, we\npresent a novel prompt-driven framework that consists of a nucleus prompter and\nSAM for automatic nucleus instance segmentation. Specifically, the prompter\nlearns to generate a unique point prompt for each nucleus while the SAM is\nfine-tuned to output the corresponding mask for the prompted nucleus.\nFurthermore, we propose the inclusion of adjacent nuclei as negative prompts to\nenhance the model's capability to identify overlapping nuclei. Without\ncomplicated post-processing, our proposed method sets a new state-of-the-art\nperformance on three challenging benchmarks. Code is available at\n\\url{github.com/windygoo/PromptNucSeg}\n","authors":["Zhongyi Shui","Yunlong Zhang","Kai Yao","Chenglu Zhu","Sunyi Zheng","Jingxiong Li","Honglin Li","Yuxuan Sun","Ruizhe Guo","Lin Yang"],"pdf_url":"https://arxiv.org/pdf/2311.15939v3.pdf","comment":"under review"},{"id":"http://arxiv.org/abs/2401.06827v2","updated":"2024-01-23T08:54:15Z","published":"2024-01-12T04:54:01Z","title":"APLe: Token-Wise Adaptive for Multi-Modal Prompt Learning","summary":" Pre-trained Vision-Language (V-L) models set the benchmark for generalization\nto downstream tasks among the noteworthy contenders. Many characteristics of\nthe V-L model have been explored in existing research including the challenge\nof the sensitivity to text input and the tuning process across multi-modal\nprompts. With the advanced utilization of the V-L model like CLIP, recent\napproaches deploy learnable prompts instead of hand-craft prompts to boost the\ngeneralization performance and address the aforementioned challenges. Inspired\nby layer-wise training, which is wildly used in image fusion, we note that\nusing a sequential training process to adapt different modalities branches of\nCLIP efficiently facilitates the improvement of generalization. In the context\nof addressing the multi-modal prompting challenge, we propose Token-wise\nAdaptive for Multi-modal Prompt Learning (APLe) for tuning both modalities\nprompts, vision and language, as tokens in a sequential manner. APLe addresses\nthe challenges in V-L models to promote prompt learning across both modalities,\nwhich indicates a competitive generalization performance in line with the\nstate-of-the-art. Preeminently, APLe shows robustness and favourable\nperformance in prompt-length experiments with an absolute advantage in adopting\nthe V-L models.\n","authors":["Guiming Cao","Kaize Shi","Hong Fu","Huaiwen Zhang","Guandong Xu"],"pdf_url":"https://arxiv.org/pdf/2401.06827v2.pdf","comment":"7 pages,3 figures"},{"id":"http://arxiv.org/abs/2401.12568v1","updated":"2024-01-23T08:54:10Z","published":"2024-01-23T08:54:10Z","title":"NeRF-AD: Neural Radiance Field with Attention-based Disentanglement for\n Talking Face Synthesis","summary":" Talking face synthesis driven by audio is one of the current research\nhotspots in the fields of multidimensional signal processing and multimedia.\nNeural Radiance Field (NeRF) has recently been brought to this research field\nin order to enhance the realism and 3D effect of the generated faces. However,\nmost existing NeRF-based methods either burden NeRF with complex learning tasks\nwhile lacking methods for supervised multimodal feature fusion, or cannot\nprecisely map audio to the facial region related to speech movements. These\nreasons ultimately result in existing methods generating inaccurate lip shapes.\nThis paper moves a portion of NeRF learning tasks ahead and proposes a talking\nface synthesis method via NeRF with attention-based disentanglement (NeRF-AD).\nIn particular, an Attention-based Disentanglement module is introduced to\ndisentangle the face into Audio-face and Identity-face using speech-related\nfacial action unit (AU) information. To precisely regulate how audio affects\nthe talking face, we only fuse the Audio-face with audio feature. In addition,\nAU information is also utilized to supervise the fusion of these two\nmodalities. Extensive qualitative and quantitative experiments demonstrate that\nour NeRF-AD outperforms state-of-the-art methods in generating realistic\ntalking face videos, including image quality and lip synchronization. To view\nvideo results, please refer to https://xiaoxingliu02.github.io/NeRF-AD.\n","authors":["Chongke Bi","Xiaoxing Liu","Zhilei Liu"],"pdf_url":"https://arxiv.org/pdf/2401.12568v1.pdf","comment":"Accepted by ICASSP 2024"},{"id":"http://arxiv.org/abs/2401.12561v1","updated":"2024-01-23T08:44:26Z","published":"2024-01-23T08:44:26Z","title":"EndoGaussian: Gaussian Splatting for Deformable Surgical Scene\n Reconstruction","summary":" Reconstructing deformable tissues from endoscopic stereo videos is essential\nin many downstream surgical applications. However, existing methods suffer from\nslow inference speed, which greatly limits their practical use. In this paper,\nwe introduce EndoGaussian, a real-time surgical scene reconstruction framework\nthat builds on 3D Gaussian Splatting. Our framework represents dynamic surgical\nscenes as canonical Gaussians and a time-dependent deformation field, which\npredicts Gaussian deformations at novel timestamps. Due to the efficient\nGaussian representation and parallel rendering pipeline, our framework\nsignificantly accelerates the rendering speed compared to previous methods. In\naddition, we design the deformation field as the combination of a lightweight\nencoding voxel and an extremely tiny MLP, allowing for efficient Gaussian\ntracking with a minor rendering burden. Furthermore, we design a holistic\nGaussian initialization method to fully leverage the surface distribution\nprior, achieved by searching informative points from across the input image\nsequence. Experiments on public endoscope datasets demonstrate that our method\ncan achieve real-time rendering speed (195 FPS real-time, 100$\\times$ gain)\nwhile maintaining the state-of-the-art reconstruction quality (35.925 PSNR) and\nthe fastest training speed (within 2 min/scene), showing significant promise\nfor intraoperative surgery applications. Code is available at:\n\\url{https://yifliu3.github.io/EndoGaussian/}.\n","authors":["Yifan Liu","Chenxin Li","Chen Yang","Yixuan Yuan"],"pdf_url":"https://arxiv.org/pdf/2401.12561v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2003.13648v2","updated":"2024-01-23T08:40:55Z","published":"2020-03-30T17:32:49Z","title":"Weakly-supervised land classification for coastal zone based on deep\n convolutional neural networks by incorporating dual-polarimetric\n characteristics into training dataset","summary":" In this work we explore the performance of DCNNs on semantic segmentation\nusing spaceborne polarimetric synthetic aperture radar (PolSAR) datasets. The\nsemantic segmentation task using PolSAR data can be categorized as weakly\nsupervised learning when the characteristics of SAR data and data annotating\nprocedures are factored in. Datasets are initially analyzed for selecting\nfeasible pre-training images. Then the differences between spaceborne and\nairborne datasets are examined in terms of spatial resolution and viewing\ngeometry. In this study we used two dual-polarimetric images acquired by\nTerraSAR-X DLR. A novel method to produce training dataset with more supervised\ninformation is developed. Specifically, a series of typical classified images\nas well as intensity images serve as training datasets. A field survey is\nconducted for an area of about 20 square kilometers to obtain a ground truth\ndataset used for accuracy evaluation. Several transfer learning strategies are\nmade for aforementioned training datasets which will be combined in a\npracticable order. Three DCNN models, including SegNet, U-Net, and LinkNet, are\nimplemented next.\n","authors":["Sheng Sun","Armando Marino","Wenze Shui","Zhongwen Hu"],"pdf_url":"https://arxiv.org/pdf/2003.13648v2.pdf","comment":"We are sorry we would like to improve it"},{"id":"http://arxiv.org/abs/2312.03408v2","updated":"2024-01-23T08:36:17Z","published":"2023-12-06T10:46:53Z","title":"Open-sourced Data Ecosystem in Autonomous Driving: the Present and\n Future","summary":" With the continuous maturation and application of autonomous driving\ntechnology, a systematic examination of open-source autonomous driving datasets\nbecomes instrumental in fostering the robust evolution of the industry\necosystem. Current autonomous driving datasets can broadly be categorized into\ntwo generations. The first-generation autonomous driving datasets are\ncharacterized by relatively simpler sensor modalities, smaller data scale, and\nis limited to perception-level tasks. KITTI, introduced in 2012, serves as a\nprominent representative of this initial wave. In contrast, the\nsecond-generation datasets exhibit heightened complexity in sensor modalities,\ngreater data scale and diversity, and an expansion of tasks from perception to\nencompass prediction and control. Leading examples of the second generation\ninclude nuScenes and Waymo, introduced around 2019. This comprehensive review,\nconducted in collaboration with esteemed colleagues from both academia and\nindustry, systematically assesses over seventy open-source autonomous driving\ndatasets from domestic and international sources. It offers insights into\nvarious aspects, such as the principles underlying the creation of high-quality\ndatasets, the pivotal role of data engine systems, and the utilization of\ngenerative foundation models to facilitate scalable data generation.\nFurthermore, this review undertakes an exhaustive analysis and discourse\nregarding the characteristics and data scales that future third-generation\nautonomous driving datasets should possess. It also delves into the scientific\nand technical challenges that warrant resolution. These endeavors are pivotal\nin advancing autonomous innovation and fostering technological enhancement in\ncritical domains. For further details, please refer to\nhttps://github.com/OpenDriveLab/DriveAGI.\n","authors":["Hongyang Li","Yang Li","Huijie Wang","Jia Zeng","Pinlong Cai","Huilin Xu","Dahua Lin","Junchi Yan","Feng Xu","Lu Xiong","Jingdong Wang","Futang Zhu","Kai Yan","Chunjing Xu","Tiancai Wang","Beipeng Mu","Shaoqing Ren","Zhihui Peng","Yu Qiao"],"pdf_url":"https://arxiv.org/pdf/2312.03408v2.pdf","comment":"This article is a simplified English translation of corresponding\n Chinese article. Please refer to Chinese version for the complete content"},{"id":"http://arxiv.org/abs/2310.07189v2","updated":"2024-01-23T08:20:05Z","published":"2023-10-11T04:38:21Z","title":"SpikePoint: An Efficient Point-based Spiking Neural Network for Event\n Cameras Action Recognition","summary":" Event cameras are bio-inspired sensors that respond to local changes in light\nintensity and feature low latency, high energy efficiency, and high dynamic\nrange. Meanwhile, Spiking Neural Networks (SNNs) have gained significant\nattention due to their remarkable efficiency and fault tolerance. By\nsynergistically harnessing the energy efficiency inherent in event cameras and\nthe spike-based processing capabilities of SNNs, their integration could enable\nultra-low-power application scenarios, such as action recognition tasks.\nHowever, existing approaches often entail converting asynchronous events into\nconventional frames, leading to additional data mapping efforts and a loss of\nsparsity, contradicting the design concept of SNNs and event cameras. To\naddress this challenge, we propose SpikePoint, a novel end-to-end point-based\nSNN architecture. SpikePoint excels at processing sparse event cloud data,\neffectively extracting both global and local features through a singular-stage\nstructure. Leveraging the surrogate training method, SpikePoint achieves high\naccuracy with few parameters and maintains low power consumption, specifically\nemploying the identity mapping feature extractor on diverse datasets.\nSpikePoint achieves state-of-the-art (SOTA) performance on four event-based\naction recognition datasets using only 16 timesteps, surpassing other SNN\nmethods. Moreover, it also achieves SOTA performance across all methods on\nthree datasets, utilizing approximately 0.3\\% of the parameters and 0.5\\% of\npower consumption employed by artificial neural networks (ANNs). These results\nemphasize the significance of Point Cloud and pave the way for many\nultra-low-power event-based data processing applications.\n","authors":["Hongwei Ren","Yue Zhou","Yulong Huang","Haotian Fu","Xiaopeng Lin","Jie Song","Bojun Cheng"],"pdf_url":"https://arxiv.org/pdf/2310.07189v2.pdf","comment":"Accepted by ICLR 2024 (Spotlight)"},{"id":"http://arxiv.org/abs/2305.13208v2","updated":"2024-01-23T08:18:27Z","published":"2023-05-16T06:19:03Z","title":"Iterative Adversarial Attack on Image-guided Story Ending Generation","summary":" Multimodal learning involves developing models that can integrate information\nfrom various sources like images and texts. In this field, multimodal text\ngeneration is a crucial aspect that involves processing data from multiple\nmodalities and outputting text. The image-guided story ending generation\n(IgSEG) is a particularly significant task, targeting on an understanding of\ncomplex relationships between text and image data with a complete story text\nending. Unfortunately, deep neural networks, which are the backbone of recent\nIgSEG models, are vulnerable to adversarial samples. Current adversarial attack\nmethods mainly focus on single-modality data and do not analyze adversarial\nattacks for multimodal text generation tasks that use cross-modal information.\nTo this end, we propose an iterative adversarial attack method\n(Iterative-attack) that fuses image and text modality attacks, allowing for an\nattack search for adversarial text and image in an more effective iterative\nway. Experimental results demonstrate that the proposed method outperforms\nexisting single-modal and non-iterative multimodal attack methods, indicating\nthe potential for improving the adversarial robustness of multimodal text\ngeneration models, such as multimodal machine translation, multimodal question\nanswering, etc.\n","authors":["Youze Wang","Wenbo Hu","Richang Hong"],"pdf_url":"https://arxiv.org/pdf/2305.13208v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.13014v4","updated":"2024-01-23T08:16:09Z","published":"2023-04-25T17:38:41Z","title":"Methods and datasets for segmentation of minimally invasive surgical\n instruments in endoscopic images and videos: A review of the state of the art","summary":" In the field of computer- and robot-assisted minimally invasive surgery,\nenormous progress has been made in recent years based on the recognition of\nsurgical instruments in endoscopic images and videos. In particular, the\ndetermination of the position and type of instruments is of great interest.\nCurrent work involves both spatial and temporal information, with the idea that\npredicting the movement of surgical tools over time may improve the quality of\nthe final segmentations. The provision of publicly available datasets has\nrecently encouraged the development of new methods, mainly based on deep\nlearning. In this review, we identify and characterize datasets used for method\ndevelopment and evaluation and quantify their frequency of use in the\nliterature. We further present an overview of the current state of research\nregarding the segmentation and tracking of minimally invasive surgical\ninstruments in endoscopic images and videos. The paper focuses on methods that\nwork purely visually, without markers of any kind attached to the instruments,\nconsidering both single-frame semantic and instance segmentation approaches, as\nwell as those that incorporate temporal information. The publications analyzed\nwere identified through the platforms Google Scholar, Web of Science, and\nPubMed. The search terms used were \"instrument segmentation\", \"instrument\ntracking\", \"surgical tool segmentation\", and \"surgical tool tracking\",\nresulting in a total of 741 articles published between 01/2015 and 07/2023, of\nwhich 123 were included using systematic selection criteria. A discussion of\nthe reviewed literature is provided, highlighting existing shortcomings and\nemphasizing the available potential for future developments.\n","authors":["Tobias Rueckert","Daniel Rueckert","Christoph Palm"],"pdf_url":"https://arxiv.org/pdf/2304.13014v4.pdf","comment":"30 pages, 10 figures"},{"id":"http://arxiv.org/abs/2110.11334v3","updated":"2024-01-23T07:36:33Z","published":"2021-10-21T17:59:41Z","title":"Generalized Out-of-Distribution Detection: A Survey","summary":" Out-of-distribution (OOD) detection is critical to ensuring the reliability\nand safety of machine learning systems. For instance, in autonomous driving, we\nwould like the driving system to issue an alert and hand over the control to\nhumans when it detects unusual scenes or objects that it has never seen during\ntraining time and cannot make a safe decision. The term, OOD detection, first\nemerged in 2017 and since then has received increasing attention from the\nresearch community, leading to a plethora of methods developed, ranging from\nclassification-based to density-based to distance-based ones. Meanwhile,\nseveral other problems, including anomaly detection (AD), novelty detection\n(ND), open set recognition (OSR), and outlier detection (OD), are closely\nrelated to OOD detection in terms of motivation and methodology. Despite common\ngoals, these topics develop in isolation, and their subtle differences in\ndefinition and problem setting often confuse readers and practitioners. In this\nsurvey, we first present a unified framework called generalized OOD detection,\nwhich encompasses the five aforementioned problems, i.e., AD, ND, OSR, OOD\ndetection, and OD. Under our framework, these five problems can be seen as\nspecial cases or sub-tasks, and are easier to distinguish. We then review each\nof these five areas by summarizing their recent technical developments, with a\nspecial focus on OOD detection methodologies. We conclude this survey with open\nchallenges and potential research directions.\n","authors":["Jingkang Yang","Kaiyang Zhou","Yixuan Li","Ziwei Liu"],"pdf_url":"https://arxiv.org/pdf/2110.11334v3.pdf","comment":"Feel free to comment on our Overleaf manuscript:\n https://www.overleaf.com/9899719915wmccvdtwpkct#c25192"},{"id":"http://arxiv.org/abs/2401.12535v1","updated":"2024-01-23T07:24:16Z","published":"2024-01-23T07:24:16Z","title":"Self-Supervised Vision Transformers Are Efficient Segmentation Learners\n for Imperfect Labels","summary":" This study demonstrates a cost-effective approach to semantic segmentation\nusing self-supervised vision transformers (SSVT). By freezing the SSVT backbone\nand training a lightweight segmentation head, our approach effectively utilizes\nimperfect labels, thereby improving robustness to label imperfections.\nEmpirical experiments show significant performance improvements over existing\nmethods for various annotation types, including scribble, point-level, and\nimage-level labels. The research highlights the effectiveness of\nself-supervised vision transformers in dealing with imperfect labels, providing\na practical and efficient solution for semantic segmentation while reducing\nannotation costs. Through extensive experiments, we confirm that our method\noutperforms baseline models for all types of imperfect labels. Especially under\nthe zero-shot vision-language-model-based label, our model exhibits 11.5\\%p\nperformance gain compared to the baseline.\n","authors":["Seungho Lee","Seoungyoon Kang","Hyunjung Shim"],"pdf_url":"https://arxiv.org/pdf/2401.12535v1.pdf","comment":"AAAI2024 Edge Intelligence Workshop (EIW) accepted"},{"id":"http://arxiv.org/abs/2401.12513v1","updated":"2024-01-23T06:08:00Z","published":"2024-01-23T06:08:00Z","title":"Detecting and recognizing characters in Greek papyri with YOLOv8, DeiT\n and SimCLR","summary":" The capacity to isolate and recognize individual characters from facsimile\nimages of papyrus manuscripts yields rich opportunities for digital analysis.\nFor this reason the `ICDAR 2023 Competition on Detection and Recognition of\nGreek Letters on Papyri' was held as part of the 17th International Conference\non Document Analysis and Recognition. This paper discusses our submission to\nthe competition. We used an ensemble of YOLOv8 models to detect and classify\nindividual characters and employed two different approaches for refining the\ncharacter predictions, including a transformer based DeiT approach and a\nResNet-50 model trained on a large corpus of unlabelled data using SimCLR, a\nself-supervised learning method. Our submission won the recognition challenge\nwith a mAP of 42.2%, and was runner-up in the detection challenge with a mean\naverage precision (mAP) of 51.4%. At the more relaxed intersection over union\nthreshold of 0.5, we achieved the highest mean average precision and mean\naverage recall results for both detection and classification. We ran our\nprediction pipeline on more than 4,500 images from the Oxyrhynchus Papyri to\nillustrate the utility of our approach, and we release the results publicly in\nmultiple formats.\n","authors":["Robert Turnbull","Evelyn Mannix"],"pdf_url":"https://arxiv.org/pdf/2401.12513v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2202.03087v3","updated":"2024-01-23T06:07:45Z","published":"2022-02-07T11:55:23Z","title":"Unsupervised Long-Term Person Re-Identification with Clothes Change","summary":" We investigate unsupervised person re-identification (Re-ID) with clothes\nchange, a new challenging problem with more practical usability and scalability\nto real-world deployment. Most existing re-id methods artificially assume the\nclothes of every single person to be stationary across space and time. This\ncondition is mostly valid for short-term re-id scenarios since an average\nperson would often change the clothes even within a single day. To alleviate\nthis assumption, several recent works have introduced the clothes change facet\nto re-id, with a focus on supervised learning person identity discriminative\nrepresentation with invariance to clothes changes. Taking a step further\ntowards this long-term re-id direction, we further eliminate the requirement of\nperson identity labels, as they are significantly more expensive and more\ntedious to annotate in comparison to short-term person re-id datasets. Compared\nto conventional unsupervised short-term re-id, this new problem is drastically\nmore challenging as different people may have similar clothes whilst the same\nperson can wear multiple suites of clothes over different locations and times\nwith very distinct appearance. To overcome such obstacles, we introduce a novel\nCurriculum Person Clustering (CPC) method that can adaptively regulate the\nunsupervised clustering criterion according to the clustering confidence.\nExperiments on three long-term person re-id datasets show that our CPC\noutperforms SOTA unsupervised re-id methods and even closely matches the\nsupervised re-id models.\n","authors":["Mingkun Li","Shupeng Cheng","Peng Xu","Xiatian Zhu","Chun-Guang Li","Jun Guo"],"pdf_url":"https://arxiv.org/pdf/2202.03087v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12511v1","updated":"2024-01-23T06:03:16Z","published":"2024-01-23T06:03:16Z","title":"Convolutional Initialization for Data-Efficient Vision Transformers","summary":" Training vision transformer networks on small datasets poses challenges. In\ncontrast, convolutional neural networks (CNNs) can achieve state-of-the-art\nperformance by leveraging their architectural inductive bias. In this paper, we\ninvestigate whether this inductive bias can be reinterpreted as an\ninitialization bias within a vision transformer network. Our approach is\nmotivated by the finding that random impulse filters can achieve almost\ncomparable performance to learned filters in CNNs. We introduce a novel\ninitialization strategy for transformer networks that can achieve comparable\nperformance to CNNs on small datasets while preserving its architectural\nflexibility.\n","authors":["Jianqiao Zheng","Xueqian Li","Simon Lucey"],"pdf_url":"https://arxiv.org/pdf/2401.12511v1.pdf","comment":"14 pages, 9 figures, 8 tables"},{"id":"http://arxiv.org/abs/2401.12507v1","updated":"2024-01-23T05:57:50Z","published":"2024-01-23T05:57:50Z","title":"Open-Set Facial Expression Recognition","summary":" Facial expression recognition (FER) models are typically trained on datasets\nwith a fixed number of seven basic classes. However, recent research works\npoint out that there are far more expressions than the basic ones. Thus, when\nthese models are deployed in the real world, they may encounter unknown\nclasses, such as compound expressions that cannot be classified into existing\nbasic classes. To address this issue, we propose the open-set FER task for the\nfirst time. Though there are many existing open-set recognition methods, we\nargue that they do not work well for open-set FER because FER data are all\nhuman faces with very small inter-class distances, which makes the open-set\nsamples very similar to close-set samples. In this paper, we are the first to\ntransform the disadvantage of small inter-class distance into an advantage by\nproposing a new way for open-set FER. Specifically, we find that small\ninter-class distance allows for sparsely distributed pseudo labels of open-set\nsamples, which can be viewed as symmetric noisy labels. Based on this novel\nobservation, we convert the open-set FER to a noisy label detection problem. We\nfurther propose a novel method that incorporates attention map consistency and\ncycle training to detect the open-set samples. Extensive experiments on various\nFER datasets demonstrate that our method clearly outperforms state-of-the-art\nopen-set recognition methods by large margins. Code is available at\nhttps://github.com/zyh-uaiaaaa.\n","authors":["Yuhang Zhang","Yue Yao","Xuannan Liu","Lixiong Qin","Wenjing Wang","Weihong Deng"],"pdf_url":"https://arxiv.org/pdf/2401.12507v1.pdf","comment":"Accepted by AAAI2024"},{"id":"http://arxiv.org/abs/2401.03179v2","updated":"2024-01-23T05:57:30Z","published":"2024-01-06T09:53:33Z","title":"Multimodal Informative ViT: Information Aggregation and Distribution for\n Hyperspectral and LiDAR Classification","summary":" In multimodal land cover classification (MLCC), a common challenge is the\nredundancy in data distribution, where irrelevant information from multiple\nmodalities can hinder the effective integration of their unique features. To\ntackle this, we introduce the Multimodal Informative Vit (MIVit), a system with\nan innovative information aggregate-distributing mechanism. This approach\nredefines redundancy levels and integrates performance-aware elements into the\nfused representation, facilitating the learning of semantics in both forward\nand backward directions. MIVit stands out by significantly reducing redundancy\nin the empirical distribution of each modality's separate and fused features.\nIt employs oriented attention fusion (OAF) for extracting shallow local\nfeatures across modalities in horizontal and vertical dimensions, and a\nTransformer feature extractor for extracting deep global features through\nlong-range attention. We also propose an information aggregation constraint\n(IAC) based on mutual information, designed to remove redundant information and\npreserve complementary information within embedded features. Additionally, the\ninformation distribution flow (IDF) in MIVit enhances performance-awareness by\ndistributing global classification information across different modalities'\nfeature maps. This architecture also addresses missing modality challenges with\nlightweight independent modality classifiers, reducing the computational load\ntypically associated with Transformers. Our results show that MIVit's\nbidirectional aggregate-distributing mechanism between modalities is highly\neffective, achieving an average overall accuracy of 95.56% across three\nmultimodal datasets. This performance surpasses current state-of-the-art\nmethods in MLCC. The code for MIVit is accessible at\nhttps://github.com/icey-zhang/MIViT.\n","authors":["Jiaqing Zhang","Jie Lei","Weiying Xie","Geng Yang","Daixun Li","Yunsong Li"],"pdf_url":"https://arxiv.org/pdf/2401.03179v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12503v1","updated":"2024-01-23T05:55:26Z","published":"2024-01-23T05:55:26Z","title":"Small Language Model Meets with Reinforced Vision Vocabulary","summary":" Playing Large Vision Language Models (LVLMs) in 2023 is trendy among the AI\ncommunity. However, the relatively large number of parameters (more than 7B) of\npopular LVLMs makes it difficult to train and deploy on consumer GPUs,\ndiscouraging many researchers with limited resources. Imagine how cool it would\nbe to experience all the features of current LVLMs on an old GTX1080ti (our\nonly game card). Accordingly, we present Vary-toy in this report, a small-size\nVary along with Qwen-1.8B as the base ``large'' language model. In Vary-toy, we\nintroduce an improved vision vocabulary, allowing the model to not only possess\nall features of Vary but also gather more generality. Specifically, we replace\nnegative samples of natural images with positive sample data driven by object\ndetection in the procedure of generating vision vocabulary, more sufficiently\nutilizing the capacity of the vocabulary network and enabling it to efficiently\nencode visual information corresponding to natural objects. For experiments,\nVary-toy can achieve 65.6% ANLS on DocVQA, 59.1% accuracy on ChartQA, 88.1%\naccuracy on RefCOCO, and 29% on MMVet. The code will be publicly available on\nthe homepage.\n","authors":["Haoran Wei","Lingyu Kong","Jinyue Chen","Liang Zhao","Zheng Ge","En Yu","Jianjian Sun","Chunrui Han","Xiangyu Zhang"],"pdf_url":"https://arxiv.org/pdf/2401.12503v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.00334v3","updated":"2024-01-23T05:38:56Z","published":"2023-12-30T21:48:20Z","title":"Explainability-Driven Leaf Disease Classification Using Adversarial\n Training and Knowledge Distillation","summary":" This work focuses on plant leaf disease classification and explores three\ncrucial aspects: adversarial training, model explainability, and model\ncompression. The models' robustness against adversarial attacks is enhanced\nthrough adversarial training, ensuring accurate classification even in the\npresence of threats. Leveraging explainability techniques, we gain insights\ninto the model's decision-making process, improving trust and transparency.\nAdditionally, we explore model compression techniques to optimize computational\nefficiency while maintaining classification performance. Through our\nexperiments, we determine that on a benchmark dataset, the robustness can be\nthe price of the classification accuracy with performance reductions of 3%-20%\nfor regular tests and gains of 50%-70% for adversarial attack tests. We also\ndemonstrate that a student model can be 15-25 times more computationally\nefficient for a slight performance reduction, distilling the knowledge of more\ncomplex models.\n","authors":["Sebastian-Vasile Echim","Iulian-Marius Tăiatu","Dumitru-Clementin Cercel","Florin Pop"],"pdf_url":"https://arxiv.org/pdf/2401.00334v3.pdf","comment":"10 pages, 8 figures, Accepted by ICAART 2024"},{"id":"http://arxiv.org/abs/2304.06470v5","updated":"2024-01-23T05:03:53Z","published":"2023-03-29T15:26:44Z","title":"Qualitative Failures of Image Generation Models and Their Application in\n Detecting Deepfakes","summary":" The ability of image and video generation models to create photorealistic\nimages has reached unprecedented heights, making it difficult to distinguish\nbetween real and fake images in many cases. However, despite this progress, a\ngap remains between the quality of generated images and those found in the real\nworld. To address this, we have reviewed a vast body of literature from both\nacademic publications and social media to identify qualitative shortcomings in\nimage generation models, which we have classified into five categories. By\nunderstanding these failures, we can identify areas where these models need\nimprovement, as well as develop strategies for detecting deep fakes. The\nprevalence of deep fakes in today's society is a serious concern, and our\nfindings can help mitigate their negative impact.\n","authors":["Ali Borji"],"pdf_url":"https://arxiv.org/pdf/2304.06470v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12488v1","updated":"2024-01-23T05:00:02Z","published":"2024-01-23T05:00:02Z","title":"An Automated Real-Time Approach for Image Processing and Segmentation of\n Fluoroscopic Images and Videos Using a Single Deep Learning Network","summary":" Image segmentation in total knee arthroplasty is crucial for precise\npreoperative planning and accurate implant positioning, leading to improved\nsurgical outcomes and patient satisfaction. The biggest challenges of image\nsegmentation in total knee arthroplasty include accurately delineating complex\nanatomical structures, dealing with image artifacts and noise, and developing\nrobust algorithms that can handle anatomical variations and pathologies\ncommonly encountered in patients. The potential of using machine learning for\nimage segmentation in total knee arthroplasty lies in its ability to improve\nsegmentation accuracy, automate the process, and provide real-time assistance\nto surgeons, leading to enhanced surgical planning, implant placement, and\npatient outcomes. This paper proposes a methodology to use deep learning for\nrobust and real-time total knee arthroplasty image segmentation. The deep\nlearning model, trained on a large dataset, demonstrates outstanding\nperformance in accurately segmenting both the implanted femur and tibia,\nachieving an impressive mean-Average-Precision (mAP) of 88.83 when compared to\nthe ground truth while also achieving a real-time segmented speed of 20 frames\nper second (fps). We have introduced a novel methodology for segmenting\nimplanted knee fluoroscopic or x-ray images that showcases remarkable levels of\naccuracy and speed, paving the way for various potential extended applications.\n","authors":["Viet Dung Nguyen","Michael T. LaCour","Richard D. Komistek"],"pdf_url":"https://arxiv.org/pdf/2401.12488v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11115v2","updated":"2024-01-23T04:41:12Z","published":"2024-01-20T04:58:06Z","title":"MotionMix: Weakly-Supervised Diffusion for Controllable Motion\n Generation","summary":" Controllable generation of 3D human motions becomes an important topic as the\nworld embraces digital transformation. Existing works, though making promising\nprogress with the advent of diffusion models, heavily rely on meticulously\ncaptured and annotated (e.g., text) high-quality motion corpus, a\nresource-intensive endeavor in the real world. This motivates our proposed\nMotionMix, a simple yet effective weakly-supervised diffusion model that\nleverages both noisy and unannotated motion sequences. Specifically, we\nseparate the denoising objectives of a diffusion model into two stages:\nobtaining conditional rough motion approximations in the initial $T-T^*$ steps\nby learning the noisy annotated motions, followed by the unconditional\nrefinement of these preliminary motions during the last $T^*$ steps using\nunannotated motions. Notably, though learning from two sources of imperfect\ndata, our model does not compromise motion generation quality compared to fully\nsupervised approaches that access gold data. Extensive experiments on several\nbenchmarks demonstrate that our MotionMix, as a versatile framework,\nconsistently achieves state-of-the-art performances on text-to-motion,\naction-to-motion, and music-to-dance tasks.\n","authors":["Nhat M. Hoang","Kehong Gong","Chuan Guo","Michael Bi Mi"],"pdf_url":"https://arxiv.org/pdf/2401.11115v2.pdf","comment":"Accepted at the 38th Association for the Advancement of Artificial\n Intelligence (AAAI) Conference on Artificial Intelligence, Main Conference"},{"id":"http://arxiv.org/abs/2401.12480v1","updated":"2024-01-23T04:19:15Z","published":"2024-01-23T04:19:15Z","title":"Explore Synergistic Interaction Across Frames for Interactive Video\n Object Segmentation","summary":" Interactive Video Object Segmentation (iVOS) is a challenging task that\nrequires real-time human-computer interaction. To improve the user experience,\nit is important to consider the user's input habits, segmentation quality,\nrunning time and memory consumption.However, existing methods compromise user\nexperience with single input mode and slow running speed. Specifically, these\nmethods only allow the user to interact with one single frame, which limits the\nexpression of the user's intent.To overcome these limitations and better align\nwith people's usage habits, we propose a framework that can accept multiple\nframes simultaneously and explore synergistic interaction across frames (SIAF).\nConcretely, we designed the Across-Frame Interaction Module that enables users\nto annotate different objects freely on multiple frames. The AFI module will\nmigrate scribble information among multiple interactive frames and generate\nmulti-frame masks. Additionally, we employ the id-queried mechanism to process\nmultiple objects in batches. Furthermore, for a more efficient propagation and\nlightweight model, we design a truncated re-propagation strategy to replace the\nprevious multi-round fusion module, which employs an across-round memory that\nstores important interaction information. Our SwinB-SIAF achieves new\nstate-of-the-art performance on DAVIS 2017 (89.6%, J&F@60). Moreover, our\nR50-SIAF is more than 3 faster than the state-of-the-art competitor under\nchallenging multi-object scenarios.\n","authors":["Kexin Li","Tao Jiang","Zongxin Yang","Yi Yang","Yueting Zhuang","Jun Xiao"],"pdf_url":"https://arxiv.org/pdf/2401.12480v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12479v1","updated":"2024-01-23T04:17:42Z","published":"2024-01-23T04:17:42Z","title":"TD^2-Net: Toward Denoising and Debiasing for Dynamic Scene Graph\n Generation","summary":" Dynamic scene graph generation (SGG) focuses on detecting objects in a video\nand determining their pairwise relationships. Existing dynamic SGG methods\nusually suffer from several issues, including 1) Contextual noise, as some\nframes might contain occluded and blurred objects. 2) Label bias, primarily due\nto the high imbalance between a few positive relationship samples and numerous\nnegative ones. Additionally, the distribution of relationships exhibits a\nlong-tailed pattern. To address the above problems, in this paper, we introduce\na network named TD$^2$-Net that aims at denoising and debiasing for dynamic\nSGG. Specifically, we first propose a denoising spatio-temporal transformer\nmodule that enhances object representation with robust contextual information.\nThis is achieved by designing a differentiable Top-K object selector that\nutilizes the gumbel-softmax sampling strategy to select the relevant\nneighborhood for each object. Second, we introduce an asymmetrical reweighting\nloss to relieve the issue of label bias. This loss function integrates\nasymmetry focusing factors and the volume of samples to adjust the weights\nassigned to individual samples. Systematic experimental results demonstrate the\nsuperiority of our proposed TD$^2$-Net over existing state-of-the-art\napproaches on Action Genome databases. In more detail, TD$^2$-Net outperforms\nthe second-best competitors by 12.7 \\% on mean-Recall@10 for predicate\nclassification.\n","authors":["Xin Lin","Chong Shi","Yibing Zhan","Zuopeng Yang","Yaqi Wu","Dacheng Tao"],"pdf_url":"https://arxiv.org/pdf/2401.12479v1.pdf","comment":"Accepted by AAAI 2024"},{"id":"http://arxiv.org/abs/2305.14800v6","updated":"2024-01-23T04:01:43Z","published":"2023-05-24T06:52:47Z","title":"Exploring Diverse In-Context Configurations for Image Captioning","summary":" After discovering that Language Models (LMs) can be good in-context few-shot\nlearners, numerous strategies have been proposed to optimize in-context\nsequence configurations. Recently, researchers in Vision-Language (VL) domains\nalso develop their few-shot learners, while they only use the simplest way,\nie., randomly sampling, to configure in-context image-text pairs. In order to\nexplore the effects of varying configurations on VL in-context learning, we\ndevised four strategies for image selection and four for caption assignment to\nconfigure in-context image-text pairs for image captioning. Here Image\nCaptioning is used as the case study since it can be seen as the\nvisually-conditioned LM. Our comprehensive experiments yield two\ncounter-intuitive but valuable insights, highlighting the distinct\ncharacteristics of VL in-context learning due to multi-modal synergy, as\ncompared to the NLP case. Furthermore, in our exploration of optimal\ncombination strategies, we observed an average performance enhancement of 20.9\nof CIDEr scores compared to the baseline. The code is given in\nhttps://github.com/yongliang-wu/ExploreCfg.\n","authors":["Xu Yang","Yongliang Wu","Mingzhuo Yang","Haokun Chen","Xin Geng"],"pdf_url":"https://arxiv.org/pdf/2305.14800v6.pdf","comment":"Accepted by NeurIPS2023"},{"id":"http://arxiv.org/abs/2301.11915v2","updated":"2024-01-23T04:00:25Z","published":"2023-01-27T18:58:42Z","title":"Understanding Self-Supervised Pretraining with Part-Aware Representation\n Learning","summary":" In this paper, we are interested in understanding self-supervised pretraining\nthrough studying the capability that self-supervised representation pretraining\nmethods learn part-aware representations. The study is mainly motivated by that\nrandom views, used in contrastive learning, and random masked (visible)\npatches, used in masked image modeling, are often about object parts.\n We explain that contrastive learning is a part-to-whole task: the projection\nlayer hallucinates the whole object representation from the object part\nrepresentation learned from the encoder, and that masked image modeling is a\npart-to-part task: the masked patches of the object are hallucinated from the\nvisible patches. The explanation suggests that the self-supervised pretrained\nencoder is required to understand the object part. We empirically compare the\noff-the-shelf encoders pretrained with several representative methods on\nobject-level recognition and part-level recognition. The results show that the\nfully-supervised model outperforms self-supervised models for object-level\nrecognition, and most self-supervised contrastive learning and masked image\nmodeling methods outperform the fully-supervised method for part-level\nrecognition. It is observed that the combination of contrastive learning and\nmasked image modeling further improves the performance.\n","authors":["Jie Zhu","Jiyang Qi","Mingyu Ding","Xiaokang Chen","Ping Luo","Xinggang Wang","Wenyu Liu","Leye Wang","Jingdong Wang"],"pdf_url":"https://arxiv.org/pdf/2301.11915v2.pdf","comment":"Accepted by TMLR"},{"id":"http://arxiv.org/abs/2203.13883v4","updated":"2024-01-23T03:54:48Z","published":"2022-03-25T19:45:33Z","title":"Multi-modal Misinformation Detection: Approaches, Challenges and\n Opportunities","summary":" As social media platforms are evolving from text-based forums into\nmulti-modal environments, the nature of misinformation in social media is also\ntransforming accordingly. Taking advantage of the fact that visual modalities\nsuch as images and videos are more favorable and attractive to the users and\ntextual contents are sometimes skimmed carelessly, misinformation spreaders\nhave recently targeted contextual connections between the modalities e.g., text\nand image. Hence many researchers have developed automatic techniques for\ndetecting possible cross-modal discordance in web-based content. We analyze,\ncategorize and identify existing approaches in addition to challenges and\nshortcomings they face in order to unearth new research opportunities in the\nfield of multi-modal misinformation detection.\n","authors":["Sara Abdali"],"pdf_url":"https://arxiv.org/pdf/2203.13883v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12471v1","updated":"2024-01-23T03:45:05Z","published":"2024-01-23T03:45:05Z","title":"Zero Shot Open-ended Video Inference","summary":" Zero-shot open-ended inference on untrimmed videos poses a significant\nchallenge, especially when no annotated data is utilized to navigate the\ninference direction. In this work, we aim to address this underexplored domain\nby introducing an adaptable framework that efficiently combines both the frozen\nvision-language (VL) model and off-the-shelf large language model (LLM) for\nconducting zero-shot open-ended inference tasks without requiring any\nadditional training or fine-tuning. Our comprehensive experiments span various\nvideo action datasets for goal inference and action recognition tasks. The\nresults demonstrate the framework's superior performance in goal inference\ncompared to conventional vision-language models in open-ended and close-ended\nscenarios. Notably, the proposed framework exhibits the capability to\ngeneralize effectively to action recognition tasks, underscoring its\nversatility and potential contributions to advancing the video-based zero-shot\nunderstanding.\n","authors":["Ee Yeo Keat","Zhang Hao","Alexander Matyasko","Basura Fernando"],"pdf_url":"https://arxiv.org/pdf/2401.12471v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.14451v2","updated":"2024-01-23T03:41:44Z","published":"2023-06-26T06:45:16Z","title":"Learning Prompt-Enhanced Context Features for Weakly-Supervised Video\n Anomaly Detection","summary":" Video anomaly detection under weak supervision presents significant\nchallenges, particularly due to the lack of frame-level annotations during\ntraining. While prior research has utilized graph convolution networks and\nself-attention mechanisms alongside multiple instance learning (MIL)-based\nclassification loss to model temporal relations and learn discriminative\nfeatures, these methods often employ multi-branch architectures to capture\nlocal and global dependencies separately, resulting in increased parameters and\ncomputational costs. Moreover, the coarse-grained interclass separability\nprovided by the binary constraint of MIL-based loss neglects the fine-grained\ndiscriminability within anomalous classes. In response, this paper introduces a\nweakly supervised anomaly detection framework that focuses on efficient context\nmodeling and enhanced semantic discriminability. We present a Temporal Context\nAggregation (TCA) module that captures comprehensive contextual information by\nreusing the similarity matrix and implementing adaptive fusion. Additionally,\nwe propose a Prompt-Enhanced Learning (PEL) module that integrates semantic\npriors using knowledge-based prompts to boost the discriminative capacity of\ncontext features while ensuring separability between anomaly sub-classes.\nExtensive experiments validate the effectiveness of our method's components,\ndemonstrating competitive performance with reduced parameters and computational\neffort on three challenging benchmarks: UCF-Crime, XD-Violence, and\nShanghaiTech datasets. Notably, our approach significantly improves the\ndetection accuracy of certain anomaly sub-classes, underscoring its practical\nvalue and efficacy. Our code is available at:\nhttps://github.com/yujiangpu20/PEL4VAD.\n","authors":["Yujiang Pu","Xiaoyu Wu","Lulu Yang","Shengjin Wang"],"pdf_url":"https://arxiv.org/pdf/2306.14451v2.pdf","comment":"13 pages, 9 figures"},{"id":"http://arxiv.org/abs/2401.12001v2","updated":"2024-01-23T03:19:12Z","published":"2024-01-22T14:52:08Z","title":"Modeling Stereo-Confidence Out of the End-to-End Stereo-Matching Network\n via Disparity Plane Sweep","summary":" We propose a novel stereo-confidence that can be measured externally to\nvarious stereo-matching networks, offering an alternative input modality choice\nof the cost volume for learning-based approaches, especially in safety-critical\nsystems. Grounded in the foundational concepts of disparity definition and the\ndisparity plane sweep, the proposed stereo-confidence method is built upon the\nidea that any shift in a stereo-image pair should be updated in a corresponding\namount shift in the disparity map. Based on this idea, the proposed\nstereo-confidence method can be summarized in three folds. 1) Using the\ndisparity plane sweep, multiple disparity maps can be obtained and treated as a\n3-D volume (predicted disparity volume), like the cost volume is constructed.\n2) One of these disparity maps serves as an anchor, allowing us to define a\ndesirable (or ideal) disparity profile at every spatial point. 3) By comparing\nthe desirable and predicted disparity profiles, we can quantify the level of\nmatching ambiguity between left and right images for confidence measurement.\nExtensive experimental results using various stereo-matching networks and\ndatasets demonstrate that the proposed stereo-confidence method not only shows\ncompetitive performance on its own but also consistent performance improvements\nwhen it is used as an input modality for learning-based stereo-confidence\nmethods.\n","authors":["Jae Young Lee","Woonghyun Ka","Jaehyun Choi","Junmo Kim"],"pdf_url":"https://arxiv.org/pdf/2401.12001v2.pdf","comment":"AAAI 2024. The first two authors contributed equally"},{"id":"http://arxiv.org/abs/2401.12019v2","updated":"2024-01-23T03:16:43Z","published":"2024-01-22T15:05:05Z","title":"Stereo-Matching Knowledge Distilled Monocular Depth Estimation Filtered\n by Multiple Disparity Consistency","summary":" In stereo-matching knowledge distillation methods of the self-supervised\nmonocular depth estimation, the stereo-matching network's knowledge is\ndistilled into a monocular depth network through pseudo-depth maps. In these\nmethods, the learning-based stereo-confidence network is generally utilized to\nidentify errors in the pseudo-depth maps to prevent transferring the errors.\nHowever, the learning-based stereo-confidence networks should be trained with\nground truth (GT), which is not feasible in a self-supervised setting. In this\npaper, we propose a method to identify and filter errors in the pseudo-depth\nmap using multiple disparity maps by checking their consistency without the\nneed for GT and a training process. Experimental results show that the proposed\nmethod outperforms the previous methods and works well on various\nconfigurations by filtering out erroneous areas where the stereo-matching is\nvulnerable, especially such as textureless regions, occlusion boundaries, and\nreflective surfaces.\n","authors":["Woonghyun Ka","Jae Young Lee","Jaehyun Choi","Junmo Kim"],"pdf_url":"https://arxiv.org/pdf/2401.12019v2.pdf","comment":"ICASSP 2024. The first two authors are equally contributed"},{"id":"http://arxiv.org/abs/2401.09495v4","updated":"2024-01-23T03:09:53Z","published":"2024-01-17T01:33:40Z","title":"IPR-NeRF: Ownership Verification meets Neural Radiance Field","summary":" Neural Radiance Field (NeRF) models have gained significant attention in the\ncomputer vision community in the recent past with state-of-the-art visual\nquality and produced impressive demonstrations. Since then, technopreneurs have\nsought to leverage NeRF models into a profitable business. Therefore, NeRF\nmodels make it worth the risk of plagiarizers illegally copying,\nre-distributing, or misusing those models. This paper proposes a comprehensive\nintellectual property (IP) protection framework for the NeRF model in both\nblack-box and white-box settings, namely IPR-NeRF. In the black-box setting, a\ndiffusion-based solution is introduced to embed and extract the watermark via a\ntwo-stage optimization process. In the white-box setting, a designated digital\nsignature is embedded into the weights of the NeRF model by adopting the sign\nloss objective. Our extensive experiments demonstrate that not only does our\napproach maintain the fidelity (\\ie, the rendering quality) of IPR-NeRF models,\nbut it is also robust against both ambiguity and removal attacks compared to\nprior arts.\n","authors":["Win Kent Ong","Kam Woh Ng","Chee Seng Chan","Yi Zhe Song","Tao Xiang"],"pdf_url":"https://arxiv.org/pdf/2401.09495v4.pdf","comment":"Error on result tabulation of state of the art method which might\n cause misleading to readers"},{"id":"http://arxiv.org/abs/2309.17105v4","updated":"2024-01-23T02:59:35Z","published":"2023-09-29T10:06:28Z","title":"Continual Action Assessment via Task-Consistent Score-Discriminative\n Feature Distribution Modeling","summary":" Action Quality Assessment (AQA) is a task that tries to answer how well an\naction is carried out. While remarkable progress has been achieved, existing\nworks on AQA assume that all the training data are visible for training in one\ntime, but do not enable continual learning on assessing new technical actions.\nIn this work, we address such a Continual Learning problem in AQA\n(Continual-AQA), which urges a unified model to learn AQA tasks sequentially\nwithout forgetting. Our idea for modeling Continual-AQA is to sequentially\nlearn a task-consistent score-discriminative feature distribution, in which the\nlatent features express a strong correlation with the score labels regardless\nof the task or action types. From this perspective, we aim to mitigate the\nforgetting in Continual-AQA from two aspects. Firstly, to fuse the features of\nnew and previous data into a score-discriminative distribution, a novel\nFeature-Score Correlation-Aware Rehearsal is proposed to store and reuse data\nfrom previous tasks with limited memory size. Secondly, an Action\nGeneral-Specific Graph is developed to learn and decouple the action-general\nand action-specific knowledge so that the task-consistent score-discriminative\nfeatures can be better extracted across various tasks. Extensive experiments\nare conducted to evaluate the contributions of proposed components. The\ncomparisons with the existing continual learning methods additionally verify\nthe effectiveness and versatility of our approach.\n","authors":["Yuan-Ming Li","Ling-An Zeng","Jing-Ke Meng","Wei-Shi Zheng"],"pdf_url":"https://arxiv.org/pdf/2309.17105v4.pdf","comment":"13 pages, 7 figures"},{"id":"http://arxiv.org/abs/2401.01520v2","updated":"2024-01-23T02:59:04Z","published":"2024-01-03T03:08:32Z","title":"S$^{2}$-DMs:Skip-Step Diffusion Models","summary":" Diffusion models have emerged as powerful generative tools, rivaling GANs in\nsample quality and mirroring the likelihood scores of autoregressive models. A\nsubset of these models, exemplified by DDIMs, exhibit an inherent asymmetry:\nthey are trained over $T$ steps but only sample from a subset of $T$ during\ngeneration. This selective sampling approach, though optimized for speed,\ninadvertently misses out on vital information from the unsampled steps, leading\nto potential compromises in sample quality. To address this issue, we present\nthe S$^{2}$-DMs, which is a new training method by using an innovative\n$L_{skip}$, meticulously designed to reintegrate the information omitted during\nthe selective sampling phase. The benefits of this approach are manifold: it\nnotably enhances sample quality, is exceptionally simple to implement, requires\nminimal code modifications, and is flexible enough to be compatible with\nvarious sampling algorithms. On the CIFAR10 dataset, models trained using our\nalgorithm showed an improvement of 3.27% to 14.06% over models trained with\ntraditional methods across various sampling algorithms (DDIMs, PNDMs, DEIS) and\ndifferent numbers of sampling steps (10, 20, ..., 1000). On the CELEBA dataset,\nthe improvement ranged from 8.97% to 27.08%. Access to the code and additional\nresources is provided in the github.\n","authors":["Yixuan Wang","Shuangyin Li"],"pdf_url":"https://arxiv.org/pdf/2401.01520v2.pdf","comment":"12 pages"},{"id":"http://arxiv.org/abs/2401.12456v1","updated":"2024-01-23T02:53:06Z","published":"2024-01-23T02:53:06Z","title":"Exploration and Improvement of Nerf-based 3D Scene Editing Techniques","summary":" NeRF's high-quality scene synthesis capability was quickly accepted by\nscholars in the years after it was proposed, and significant progress has been\nmade in 3D scene representation and synthesis. However, the high computational\ncost limits intuitive and efficient editing of scenes, making NeRF's\ndevelopment in the scene editing field facing many challenges. This paper\nreviews the preliminary explorations of scholars on NeRF in the scene or object\nediting field in recent years, mainly changing the shape and texture of scenes\nor objects in new synthesized scenes; through the combination of residual\nmodels such as GaN and Transformer with NeRF, the generalization ability of\nNeRF scene editing has been further expanded, including realizing real-time new\nperspective editing feedback, multimodal editing of text synthesized 3D scenes,\n4D synthesis performance, and in-depth exploration in light and shadow editing,\ninitially achieving optimization of indirect touch editing and detail\nrepresentation in complex scenes. Currently, most NeRF editing methods focus on\nthe touch points and materials of indirect points, but when dealing with more\ncomplex or larger 3D scenes, it is difficult to balance accuracy, breadth,\nefficiency, and quality. Overcoming these challenges may become the direction\nof future NeRF 3D scene editing technology.\n","authors":["Shun Fang","Ming Cui","Xing Feng","Yanan Zhang"],"pdf_url":"https://arxiv.org/pdf/2401.12456v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12452v1","updated":"2024-01-23T02:41:06Z","published":"2024-01-23T02:41:06Z","title":"Self-supervised Learning of LiDAR 3D Point Clouds via 2D-3D Neural\n Calibration","summary":" This paper introduces a novel self-supervised learning framework for\nenhancing 3D perception in autonomous driving scenes. Specifically, our\napproach, named NCLR, focuses on 2D-3D neural calibration, a novel pretext task\nthat estimates the rigid transformation aligning camera and LiDAR coordinate\nsystems. First, we propose the learnable transformation alignment to bridge the\ndomain gap between image and point cloud data, converting features into a\nunified representation space for effective comparison and matching. Second, we\nidentify the overlapping area between the image and point cloud with the fused\nfeatures. Third, we establish dense 2D-3D correspondences to estimate the rigid\ntransformation. The framework not only learns fine-grained matching from points\nto pixels but also achieves alignment of the image and point cloud at a\nholistic level, understanding their relative pose. We demonstrate NCLR's\nefficacy by applying the pre-trained backbone to downstream tasks, such as\nLiDAR-based 3D semantic segmentation, object detection, and panoptic\nsegmentation. Comprehensive experiments on various datasets illustrate the\nsuperiority of NCLR over existing self-supervised methods. The results confirm\nthat joint learning from different modalities significantly enhances the\nnetwork's understanding abilities and effectiveness of learned representation.\nCode will be available at \\url{https://github.com/Eaphan/NCLR}.\n","authors":["Yifan Zhang","Siyu Ren","Junhui Hou","Jinjian Wu","Guangming Shi"],"pdf_url":"https://arxiv.org/pdf/2401.12452v1.pdf","comment":"Under review"},{"id":"http://arxiv.org/abs/2401.12451v1","updated":"2024-01-23T02:30:16Z","published":"2024-01-23T02:30:16Z","title":"Methods and strategies for improving the novel view synthesis quality of\n neural radiation field","summary":" Neural Radiation Field (NeRF) technology can learn a 3D implicit model of a\nscene from 2D images and synthesize realistic novel view images. This\ntechnology has received widespread attention from the industry and has good\napplication prospects. In response to the problem that the rendering quality of\nNeRF images needs to be improved, many researchers have proposed various\nmethods to improve the rendering quality in the past three years. The latest\nrelevant papers are classified and reviewed, the technical principles behind\nquality improvement are analyzed, and the future evolution direction of quality\nimprovement methods is discussed. This study can help researchers quickly\nunderstand the current state and evolutionary context of technology in this\nfield, which is helpful in inspiring the development of more efficient\nalgorithms and promoting the application of NeRF technology in related fields.\n","authors":["Shun Fang","Ming Cui","Xing Feng","Yanna Lv"],"pdf_url":"https://arxiv.org/pdf/2401.12451v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.02317v3","updated":"2024-01-23T02:29:35Z","published":"2023-05-03T17:58:29Z","title":"Visual Chain of Thought: Bridging Logical Gaps with Multimodal\n Infillings","summary":" Recent advances in large language models elicit reasoning in a\nchain-of-thought that allows models to decompose problems in a human-like\nfashion. Though this paradigm improves multi-step reasoning ability in language\nmodels, it is limited by being unimodal and applied mainly to\nquestion-answering tasks. We claim that incorporating visual augmentation into\nreasoning is essential, especially for complex, imaginative tasks.\nConsequently, we introduce VCoT, a novel method that leverages chain-of-thought\nprompting with vision-language grounding to recursively bridge the logical gaps\nwithin sequential data. Our method uses visual guidance to generate synthetic\nmultimodal infillings that add consistent and novel information to reduce the\nlogical gaps for downstream tasks that can benefit from temporal reasoning, as\nwell as provide interpretability into models' multi-step reasoning. We apply\nVCoT to the Visual Storytelling and WikiHow summarization datasets and\ndemonstrate through human evaluation that VCoT offers novel and consistent\nsynthetic data augmentation beating chain-of-thought baselines, which can be\nused to enhance downstream performance.\n","authors":["Daniel Rose","Vaishnavi Himakunthala","Andy Ouyang","Ryan He","Alex Mei","Yujie Lu","Michael Saxon","Chinmay Sonar","Diba Mirza","William Yang Wang"],"pdf_url":"https://arxiv.org/pdf/2305.02317v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12447v1","updated":"2024-01-23T02:25:23Z","published":"2024-01-23T02:25:23Z","title":"NIV-SSD: Neighbor IoU-Voting Single-Stage Object Detector From Point\n Cloud","summary":" Previous single-stage detectors typically suffer the misalignment between\nlocalization accuracy and classification confidence. To solve the misalignment\nproblem, we introduce a novel rectification method named neighbor IoU-voting\n(NIV) strategy. Typically, classification and regression are treated as\nseparate branches, making it challenging to establish a connection between\nthem. Consequently, the classification confidence cannot accurately reflect the\nregression quality. NIV strategy can serve as a bridge between classification\nand regression branches by calculating two types of statistical data from the\nregression output to correct the classification confidence. Furthermore, to\nalleviate the imbalance of detection accuracy for complete objects with dense\npoints (easy objects) and incomplete objects with sparse points (difficult\nobjects), we propose a new data augmentation scheme named object resampling. It\nundersamples easy objects and oversamples difficult objects by randomly\ntransforming part of easy objects into difficult objects. Finally, combining\nthe NIV strategy and object resampling augmentation, we design an efficient\nsingle-stage detector termed NIV-SSD. Extensive experiments on several datasets\nindicate the effectiveness of the NIV strategy and the competitive performance\nof the NIV-SSD detector. The code will be available at\nhttps://github.com/Say2L/NIV-SSD.\n","authors":["Shuai Liu","Di Wang","Quan Wang","Kai Huang"],"pdf_url":"https://arxiv.org/pdf/2401.12447v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12439v1","updated":"2024-01-23T02:18:53Z","published":"2024-01-23T02:18:53Z","title":"MAST: Video Polyp Segmentation with a Mixture-Attention Siamese\n Transformer","summary":" Accurate segmentation of polyps from colonoscopy videos is of great\nsignificance to polyp treatment and early prevention of colorectal cancer.\nHowever, it is challenging due to the difficulties associated with modelling\nlong-range spatio-temporal relationships within a colonoscopy video. In this\npaper, we address this challenging task with a novel Mixture-Attention Siamese\nTransformer (MAST), which explicitly models the long-range spatio-temporal\nrelationships with a mixture-attention mechanism for accurate polyp\nsegmentation. Specifically, we first construct a Siamese transformer\narchitecture to jointly encode paired video frames for their feature\nrepresentations. We then design a mixture-attention module to exploit the\nintra-frame and inter-frame correlations, enhancing the features with rich\nspatio-temporal relationships. Finally, the enhanced features are fed to two\nparallel decoders for predicting the segmentation maps. To the best of our\nknowledge, our MAST is the first transformer model dedicated to video polyp\nsegmentation. Extensive experiments on the large-scale SUN-SEG benchmark\ndemonstrate the superior performance of MAST in comparison with the\ncutting-edge competitors. Our code is publicly available at\nhttps://github.com/Junqing-Yang/MAST.\n","authors":["Geng Chen","Junqing Yang","Xiaozhou Pu","Ge-Peng Ji","Huan Xiong","Yongsheng Pan","Hengfei Cui","Yong Xia"],"pdf_url":"https://arxiv.org/pdf/2401.12439v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12438v1","updated":"2024-01-23T02:14:05Z","published":"2024-01-23T02:14:05Z","title":"Secure Federated Learning Approaches to Diagnosing COVID-19","summary":" The recent pandemic has underscored the importance of accurately diagnosing\nCOVID-19 in hospital settings. A major challenge in this regard is\ndifferentiating COVID-19 from other respiratory illnesses based on chest\nX-rays, compounded by the restrictions of HIPAA compliance which limit the\ncomparison of patient X-rays. This paper introduces a HIPAA-compliant model to\naid in the diagnosis of COVID-19, utilizing federated learning. Federated\nlearning is a distributed machine learning approach that allows for algorithm\ntraining across multiple decentralized devices using local data samples,\nwithout the need for data sharing. Our model advances previous efforts in chest\nX-ray diagnostic models. We examined leading models from established\ncompetitions in this domain and developed our own models tailored to be\neffective with specific hospital data. Considering the model's operation in a\nfederated learning context, we explored the potential impact of biased data\nupdates on the model's performance. To enhance hospital understanding of the\nmodel's decision-making process and to verify that the model is not focusing on\nirrelevant features, we employed a visualization technique that highlights key\nfeatures in chest X-rays indicative of a positive COVID-19 diagnosis.\n","authors":["Rittika Adhikari","Christopher Settles"],"pdf_url":"https://arxiv.org/pdf/2401.12438v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11687v2","updated":"2024-01-23T02:08:09Z","published":"2024-01-22T04:54:42Z","title":"TIM: An Efficient Temporal Interaction Module for Spiking Transformer","summary":" Spiking Neural Networks (SNNs), as the third generation of neural networks,\nhave gained prominence for their biological plausibility and computational\nefficiency, especially in processing diverse datasets. The integration of\nattention mechanisms, inspired by advancements in neural network architectures,\nhas led to the development of Spiking Transformers. These have shown promise in\nenhancing SNNs' capabilities, particularly in the realms of both static and\nneuromorphic datasets. Despite their progress, a discernible gap exists in\nthese systems, specifically in the Spiking Self Attention (SSA) mechanism's\neffectiveness in leveraging the temporal processing potential of SNNs. To\naddress this, we introduce the Temporal Interaction Module (TIM), a novel,\nconvolution-based enhancement designed to augment the temporal data processing\nabilities within SNN architectures. TIM's integration into existing SNN\nframeworks is seamless and efficient, requiring minimal additional parameters\nwhile significantly boosting their temporal information handling capabilities.\nThrough rigorous experimentation, TIM has demonstrated its effectiveness in\nexploiting temporal information, leading to state-of-the-art performance across\nvarious neuromorphic datasets.\n","authors":["Sicheng Shen","Dongcheng Zhao","Guobin Shen","Yi Zeng"],"pdf_url":"https://arxiv.org/pdf/2401.11687v2.pdf","comment":"9pages,6figures"},{"id":"http://arxiv.org/abs/2401.12433v1","updated":"2024-01-23T01:52:49Z","published":"2024-01-23T01:52:49Z","title":"A Novel Garment Transfer Method Supervised by Distilled Knowledge of\n Virtual Try-on Model","summary":" When a shopper chooses garments online, garment transfer technology wears the\ngarment from the model image onto the shopper's image, allowing the shopper to\ndecide whether the garment is suitable for them. As garment transfer leverages\nwild and cheap person image as garment condition, it has attracted tremendous\ncommunity attention and holds vast commercial potential. However, since the\nground truth of garment transfer is almost unavailable in reality, previous\nstudies have treated garment transfer as either pose transfer or garment-pose\ndisentanglement, and trained garment transfer in self-supervised learning, yet\ndo not cover garment transfer intentions completely. Therefore, the training\nsupervising the garment transfer is a rock-hard issue. Notably, virtual try-on\ntechnology has exhibited superior performance using self-supervised learning.\nWe supervise the garment transfer training via knowledge distillation from\nvirtual try-on. Specifically, we first train the transfer parsing reasoning\nmodel at multi-phases to provide shape guidance for downstream tasks. The\ntransfer parsing reasoning model learns the response and feature knowledge from\nthe try-on parsing reasoning model and absorbs the hard knowledge from the\nground truth. By leveraging the warping knowledge from virtual try-on, we\nestimate a progressive flow to precisely warp the garment by learning the shape\nand content correspondence. To enhance transfer realism, we propose a\nwell-designed arm regrowth task to infer exposed skin pixel content.\nExperiments demonstrate that our method has state-of-the-art performance in\ntransferring garments between person compared with other virtual try-on and\ngarment transfer methods.\n","authors":["Naiyu Fang","Lemiao Qiu","Shuyou Zhang","Zili Wang","Kerui Hu","Jianrong Tan"],"pdf_url":"https://arxiv.org/pdf/2401.12433v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.10610v3","updated":"2024-01-23T01:48:20Z","published":"2023-08-21T10:20:46Z","title":"Ultrafast and Ultralight Network-Based Intelligent System for Real-time\n Diagnosis of Ear Diseases in Any Devices","summary":" Traditional ear disease diagnosis heavily depends on experienced specialists\nand specialized equipment, frequently resulting in misdiagnoses, treatment\ndelays, and financial burdens for some patients. Utilizing deep learning models\nfor efficient ear disease diagnosis has proven effective and affordable.\nHowever, existing research overlooked model inference speed and parameter size\nrequired for deployment. To tackle these challenges, we constructed a\nlarge-scale dataset comprising eight ear disease categories and normal ear\ncanal samples from two hospitals. Inspired by ShuffleNetV2, we developed\nBest-EarNet, an ultrafast and ultralight network enabling real-time ear disease\ndiagnosis. Best-EarNet incorporates the novel Local-Global Spatial Feature\nFusion Module which can capture global and local spatial information\nsimultaneously and guide the network to focus on crucial regions within feature\nmaps at various levels, mitigating low accuracy issues. Moreover, our network\nuses multiple auxiliary classification heads for efficient parameter\noptimization. With 0.77M parameters, Best-EarNet achieves an average frames per\nsecond of 80 on CPU. Employing transfer learning and five-fold cross-validation\nwith 22,581 images from Hospital-1, the model achieves an impressive 95.23%\naccuracy. External testing on 1,652 images from Hospital-2 validates its\nperformance, yielding 92.14% accuracy. Compared to state-of-the-art networks,\nBest-EarNet establishes a new state-of-the-art (SOTA) in practical\napplications. Most importantly, we developed an intelligent diagnosis system\ncalled Ear Keeper, which can be deployed on common electronic devices. By\nmanipulating a compact electronic otoscope, users can perform comprehensive\nscanning and diagnosis of the ear canal using real-time video. This study\nprovides a novel paradigm for ear endoscopy and other medical endoscopic image\nrecognition applications.\n","authors":["Yubiao Yue","Xinyu Zeng","Xiaoqiang Shi","Meiping Zhang","Haihua Liang","Fan Zhang","Yanmei Chen","Zefeng Xie","Wenrui Wu","Zhenzhang Li"],"pdf_url":"https://arxiv.org/pdf/2308.10610v3.pdf","comment":"18 pages,8 figures"},{"id":"http://arxiv.org/abs/2209.09930v2","updated":"2024-01-23T01:36:36Z","published":"2022-09-20T18:08:34Z","title":"Deep Superpixel Generation and Clustering for Weakly Supervised\n Segmentation of Brain Tumors in MR Images","summary":" Training machine learning models to segment tumors and other anomalies in\nmedical images is an important step for developing diagnostic tools but\ngenerally requires manually annotated ground truth segmentations, which\nnecessitates significant time and resources. This work proposes the use of a\nsuperpixel generation model and a superpixel clustering model to enable weakly\nsupervised brain tumor segmentations. The proposed method utilizes binary\nimage-level classification labels, which are readily accessible, to\nsignificantly improve the initial region of interest segmentations generated by\nstandard weakly supervised methods without requiring ground truth annotations.\nWe used 2D slices of magnetic resonance brain scans from the Multimodal Brain\nTumor Segmentation Challenge 2020 dataset and labels indicating the presence of\ntumors to train the pipeline. On the test cohort, our method achieved a mean\nDice coefficient of 0.691 and a mean 95% Hausdorff distance of 18.1,\noutperforming existing superpixel-based weakly supervised segmentation methods.\n","authors":["Jay J. Yoo","Khashayar Namdar","Farzad Khalvati"],"pdf_url":"https://arxiv.org/pdf/2209.09930v2.pdf","comment":"12 pages, LaTeX; updated methodology, added additional results,\n revised discussion"},{"id":"http://arxiv.org/abs/2401.12425v1","updated":"2024-01-23T01:25:00Z","published":"2024-01-23T01:25:00Z","title":"The Neglected Tails of Vision-Language Models","summary":" Vision-language models (VLMs) excel in zero-shot recognition but exhibit\ndrastically imbalanced performance across visual concepts. For example, CLIP,\ndespite an impressive mean zero-shot accuracy on ImageNet (72.7%), yields\n$<$10% on ten concepts (e.g., gyromitra and night snake), presumably, because\nthese concepts are under-represented in VLMs' imbalanced pretraining data. Yet,\nassessing this imbalance is challenging as it is non-trivial to calculate the\nfrequency of specific concepts within VLMs' large-scale pretraining data. Our\nwork makes the first attempt to measure the concept frequency by analyzing\npretraining texts. We use off-the-shelf language models to help count relevant\ntexts that contain synonyms of the given concepts and resolve linguistic\nambiguity. We confirm that popular VLM datasets like LAION indeed exhibit\nlong-tailed concept distributions, which strongly correlate with per-class\naccuracies. Further, contemporary multimodal systems, e.g., visual chatbots and\ntext-to-image generators, also struggle with the rare concepts identified by\nour method. To mitigate VLMs' imbalanced performance in zero-shot recognition,\nwe propose REtrieval-Augmented Learning REAL. First, instead of prompting VLMs\nusing the original class names, REAL uses their most frequent synonyms found in\nVLMs' pretraining texts. This already outperforms human-engineered and\nLLM-generated prompts over nine benchmark datasets, likely because VLMs have\nseen more images associated with the frequently used synonyms. Second, REAL\nuses all the concept synonyms to retrieve a small, class-balanced set of\npretraining data to train a robust classifier. REAL surpasses the recent\nretrieval-augmented solution REACT, using 400x less storage and 10,000x less\ntraining time!\n","authors":["Shubham Parashar","Zhiqiu Lin","Tian Liu","Xiangjue Dong","Yanan Li","Deva Ramanan","James Caverlee","Shu Kong"],"pdf_url":"https://arxiv.org/pdf/2401.12425v1.pdf","comment":"Project Page:\n https://shubhamprshr27.github.io/neglected-tails-of-vlms/"},{"id":"http://arxiv.org/abs/2401.12422v1","updated":"2024-01-23T01:11:10Z","published":"2024-01-23T01:11:10Z","title":"InverseMatrixVT3D: An Efficient Projection Matrix-Based Approach for 3D\n Occupancy Prediction","summary":" This paper introduces InverseMatrixVT3D, an efficient method for transforming\nmulti-view image features into 3D feature volumes for 3D semantic occupancy\nprediction. Existing methods for constructing 3D volumes often rely on depth\nestimation, device-specific operators, or transformer queries, which hinders\nthe widespread adoption of 3D occupancy models. In contrast, our approach\nleverages two projection matrices to store the static mapping relationships and\nmatrix multiplications to efficiently generate global Bird's Eye View (BEV)\nfeatures and local 3D feature volumes. Specifically, we achieve this by\nperforming matrix multiplications between multi-view image feature maps and two\nsparse projection matrices. We introduce a sparse matrix handling technique for\nthe projection matrices to optimise GPU memory usage. Moreover, a global-local\nattention fusion module is proposed to integrate the global BEV features with\nthe local 3D feature volumes to obtain the final 3D volume. We also employ a\nmulti-scale supervision mechanism to further enhance performance. Comprehensive\nexperiments on the nuScenes dataset demonstrate the simplicity and\neffectiveness of our method. The code will be made available\nat:https://github.com/DanielMing123/InverseMatrixVT3D\n","authors":["Zhenxing Ming","Julie Stephany Berrio","Mao Shan","Stewart Worrall"],"pdf_url":"https://arxiv.org/pdf/2401.12422v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12421v1","updated":"2024-01-23T01:10:25Z","published":"2024-01-23T01:10:25Z","title":"AdaEmbed: Semi-supervised Domain Adaptation in the Embedding Space","summary":" Semi-supervised domain adaptation (SSDA) presents a critical hurdle in\ncomputer vision, especially given the frequent scarcity of labeled data in\nreal-world settings. This scarcity often causes foundation models, trained on\nextensive datasets, to underperform when applied to new domains. AdaEmbed, our\nnewly proposed methodology for SSDA, offers a promising solution to these\nchallenges. Leveraging the potential of unlabeled data, AdaEmbed facilitates\nthe transfer of knowledge from a labeled source domain to an unlabeled target\ndomain by learning a shared embedding space. By generating accurate and uniform\npseudo-labels based on the established embedding space, the model overcomes the\nlimitations of conventional SSDA, thus enhancing performance significantly. Our\nmethod's effectiveness is validated through extensive experiments on benchmark\ndatasets such as DomainNet, Office-Home, and VisDA-C, where AdaEmbed\nconsistently outperforms all the baselines, setting a new state of the art for\nSSDA. With its straightforward implementation and high data efficiency,\nAdaEmbed stands out as a robust and pragmatic solution for real-world\nscenarios, where labeled data is scarce. To foster further research and\napplication in this area, we are sharing the codebase of our unified framework\nfor semi-supervised domain adaptation.\n","authors":["Ali Mottaghi","Mohammad Abdullah Jamal","Serena Yeung","Omid Mohareri"],"pdf_url":"https://arxiv.org/pdf/2401.12421v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.11321v2","updated":"2024-01-23T01:09:46Z","published":"2023-05-18T22:09:32Z","title":"JoIN: Joint GANs Inversion for Intrinsic Image Decomposition","summary":" In this work, we propose to solve ill-posed inverse imaging problems using a\nbank of Generative Adversarial Networks (GAN) as a prior and apply our method\nto the case of Intrinsic Image Decomposition for faces and materials. Our\nmethod builds on the demonstrated success of GANs to capture complex image\ndistributions. At the core of our approach is the idea that the latent space of\na GAN is a well-suited optimization domain to solve inverse problems. Given an\ninput image, we propose to jointly inverse the latent codes of a set of GANs\nand combine their outputs to reproduce the input. Contrary to most GAN\ninversion methods which are limited to inverting only a single GAN, we\ndemonstrate that it is possible to maintain distribution priors while inverting\nseveral GANs jointly. We show that our approach is modular, allowing various\nforward imaging models, and that it can successfully decompose both synthetic\nand real images.\n","authors":["Viraj Shah","Svetlana Lazebnik","Julien Philip"],"pdf_url":"https://arxiv.org/pdf/2305.11321v2.pdf","comment":"Project webpage is available at https://virajshah.com/join"},{"id":"http://arxiv.org/abs/2401.12419v1","updated":"2024-01-23T00:42:04Z","published":"2024-01-23T00:42:04Z","title":"Multi-modal News Understanding with Professionally Labelled Videos\n (ReutersViLNews)","summary":" While progress has been made in the domain of video-language understanding,\ncurrent state-of-the-art algorithms are still limited in their ability to\nunderstand videos at high levels of abstraction, such as news-oriented videos.\nAlternatively, humans easily amalgamate information from video and language to\ninfer information beyond what is visually observable in the pixels. An example\nof this is watching a news story, where the context of the event can play as\nbig of a role in understanding the story as the event itself. Towards a\nsolution for designing this ability in algorithms, we present a large-scale\nanalysis on an in-house dataset collected by the Reuters News Agency, called\nReuters Video-Language News (ReutersViLNews) dataset which focuses on\nhigh-level video-language understanding with an emphasis on long-form news. The\nReutersViLNews Dataset consists of long-form news videos collected and labeled\nby news industry professionals over several years and contains prominent news\nreporting from around the world. Each video involves a single story and\ncontains action shots of the actual event, interviews with people associated\nwith the event, footage from nearby areas, and more. ReutersViLNews dataset\ncontains videos from seven subject categories: disaster, finance,\nentertainment, health, politics, sports, and miscellaneous with annotations\nfrom high-level to low-level, title caption, visual video description,\nhigh-level story description, keywords, and location. We first present an\nanalysis of the dataset statistics of ReutersViLNews compared to previous\ndatasets. Then we benchmark state-of-the-art approaches for four different\nvideo-language tasks. The results suggest that news-oriented videos are a\nsubstantial challenge for current video-language understanding algorithms and\nwe conclude by providing future directions in designing approaches to solve the\nReutersViLNews dataset.\n","authors":["Shih-Han Chou","Matthew Kowal","Yasmin Niknam","Diana Moyano","Shayaan Mehdi","Richard Pito","Cheng Zhang","Ian Knopke","Sedef Akinli Kocak","Leonid Sigal","Yalda Mohsenzadeh"],"pdf_url":"https://arxiv.org/pdf/2401.12419v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12414v1","updated":"2024-01-23T00:06:19Z","published":"2024-01-23T00:06:19Z","title":"Icy Moon Surface Simulation and Stereo Depth Estimation for Sampling\n Autonomy","summary":" Sampling autonomy for icy moon lander missions requires understanding of\ntopographic and photometric properties of the sampling terrain. Unavailability\nof high resolution visual datasets (either bird-eye view or point-of-view from\na lander) is an obstacle for selection, verification or development of\nperception systems. We attempt to alleviate this problem by: 1) proposing\nGraphical Utility for Icy moon Surface Simulations (GUISS) framework, for\nversatile stereo dataset generation that spans the spectrum of bulk photometric\nproperties, and 2) focusing on a stereo-based visual perception system and\nevaluating both traditional and deep learning-based algorithms for depth\nestimation from stereo matching. The surface reflectance properties of icy moon\nterrains (Enceladus and Europa) are inferred from multispectral datasets of\nprevious missions. With procedural terrain generation and physically valid\nillumination sources, our framework can fit a wide range of hypotheses with\nrespect to visual representations of icy moon terrains. This is followed by a\nstudy over the performance of stereo matching algorithms under different visual\nhypotheses. Finally, we emphasize the standing challenges to be addressed for\nsimulating perception data assets for icy moons such as Enceladus and Europa.\nOur code can be found here: https://github.com/nasa-jpl/guiss.\n","authors":["Ramchander Bhaskara","Georgios Georgakis","Jeremy Nash","Marissa Cameron","Joseph Bowkett","Adnan Ansar","Manoranjan Majji","Paul Backes"],"pdf_url":"https://arxiv.org/pdf/2401.12414v1.pdf","comment":"Software: https://github.com/nasa-jpl/guiss. IEEE Aerospace\n Conference 2024"},{"id":"http://arxiv.org/abs/2401.13147v1","updated":"2024-01-23T23:50:04Z","published":"2024-01-23T23:50:04Z","title":"Deep Spatiotemporal Clutter Filtering of Transthoracic Echocardiographic\n Images Using a 3D Convolutional Auto-Encoder","summary":" This study presents a deep convolutional auto-encoder network for filtering\nreverberation artifacts, from transthoracic echocardiographic (TTE) image\nsequences. Given the spatiotemporal nature of these artifacts, the filtering\nnetwork was built using 3D convolutional layers to suppress the clutter\npatterns throughout the cardiac cycle. The network was designed by taking\nadvantage of: i) an attention mechanism to focus primarily on cluttered regions\nand ii) residual learning to preserve fine structures of the image frames. To\ntrain the deep network, a diverse set of artifact patterns was simulated and\nthe simulated patterns were superimposed onto artifact-free ultra-realistic\nsynthetic TTE sequences of six ultrasound vendors to generate input of the\nfiltering network. The artifact-free sequences served as ground-truth.\nPerformance of the filtering network was evaluated using unseen synthetic as\nwell as in-vivo artifactual sequences. Satisfactory results obtained using the\nlatter dataset confirmed the good generalization performance of the proposed\nnetwork which was trained using the synthetic sequences and simulated artifact\npatterns. Suitability of the clutter-filtered sequences for further processing\nwas assessed by computing segmental strain curves from them. The results showed\nthat the large discrepancy between the strain profiles computed from the\ncluttered segments and their corresponding segments in the clutter-free images\nwas significantly reduced after filtering the sequences using the proposed\nnetwork. The trained deep network could process an artifactual TTE sequence in\na fraction of a second and can be used for real-time clutter filtering.\nMoreover, it can improve the precision of the clinical indexes that are\ncomputed from the TTE sequences. The source code of the proposed method is\navailable at:\nhttps://github.com/MahdiTabassian/Deep-Clutter-Filtering/tree/main.\n","authors":["Mahdi Tabassian","Somayeh Akbari. S","Sandro Queirós","Jan D'hooge"],"pdf_url":"https://arxiv.org/pdf/2401.13147v1.pdf","comment":"18 pages, 14 figures"},{"id":"http://arxiv.org/abs/2401.00496v2","updated":"2024-01-23T23:30:57Z","published":"2023-12-31T13:32:18Z","title":"SAR-RARP50: Segmentation of surgical instrumentation and Action\n Recognition on Robot-Assisted Radical Prostatectomy Challenge","summary":" Surgical tool segmentation and action recognition are fundamental building\nblocks in many computer-assisted intervention applications, ranging from\nsurgical skills assessment to decision support systems. Nowadays,\nlearning-based action recognition and segmentation approaches outperform\nclassical methods, relying, however, on large, annotated datasets. Furthermore,\naction recognition and tool segmentation algorithms are often trained and make\npredictions in isolation from each other, without exploiting potential\ncross-task relationships. With the EndoVis 2022 SAR-RARP50 challenge, we\nrelease the first multimodal, publicly available, in-vivo, dataset for surgical\naction recognition and semantic instrumentation segmentation, containing 50\nsuturing video segments of Robotic Assisted Radical Prostatectomy (RARP). The\naim of the challenge is twofold. First, to enable researchers to leverage the\nscale of the provided dataset and develop robust and highly accurate\nsingle-task action recognition and tool segmentation approaches in the surgical\ndomain. Second, to further explore the potential of multitask-based learning\napproaches and determine their comparative advantage against their single-task\ncounterparts. A total of 12 teams participated in the challenge, contributing 7\naction recognition methods, 9 instrument segmentation techniques, and 4\nmultitask approaches that integrated both action recognition and instrument\nsegmentation. The complete SAR-RARP50 dataset is available at:\nhttps://rdr.ucl.ac.uk/projects/SARRARP50_Segmentation_of_surgical_instrumentation_and_Action_Recognition_on_Robot-Assisted_Radical_Prostatectomy_Challenge/191091\n","authors":["Dimitrios Psychogyios","Emanuele Colleoni","Beatrice Van Amsterdam","Chih-Yang Li","Shu-Yu Huang","Yuchong Li","Fucang Jia","Baosheng Zou","Guotai Wang","Yang Liu","Maxence Boels","Jiayu Huo","Rachel Sparks","Prokar Dasgupta","Alejandro Granados","Sebastien Ourselin","Mengya Xu","An Wang","Yanan Wu","Long Bai","Hongliang Ren","Atsushi Yamada","Yuriko Harai","Yuto Ishikawa","Kazuyuki Hayashi","Jente Simoens","Pieter DeBacker","Francesco Cisternino","Gabriele Furnari","Alex Mottrie","Federica Ferraguti","Satoshi Kondo","Satoshi Kasai","Kousuke Hirasawa","Soohee Kim","Seung Hyun Lee","Kyu Eun Lee","Hyoun-Joong Kong","Kui Fu","Chao Li","Shan An","Stefanie Krell","Sebastian Bodenstedt","Nicolas Ayobi","Alejandra Perez","Santiago Rodriguez","Juanita Puentes","Pablo Arbelaez","Omid Mohareri","Danail Stoyanov"],"pdf_url":"https://arxiv.org/pdf/2401.00496v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.13140v1","updated":"2024-01-23T23:28:15Z","published":"2024-01-23T23:28:15Z","title":"Dual-Domain Coarse-to-Fine Progressive Estimation Network for\n Simultaneous Denoising, Limited-View Reconstruction, and Attenuation\n Correction of Cardiac SPECT","summary":" Single-Photon Emission Computed Tomography (SPECT) is widely applied for the\ndiagnosis of coronary artery diseases. Low-dose (LD) SPECT aims to minimize\nradiation exposure but leads to increased image noise. Limited-view (LV) SPECT,\nsuch as the latest GE MyoSPECT ES system, enables accelerated scanning and\nreduces hardware expenses but degrades reconstruction accuracy. Additionally,\nComputed Tomography (CT) is commonly used to derive attenuation maps\n($\\mu$-maps) for attenuation correction (AC) of cardiac SPECT, but it will\nintroduce additional radiation exposure and SPECT-CT misalignments. Although\nvarious methods have been developed to solely focus on LD denoising, LV\nreconstruction, or CT-free AC in SPECT, the solution for simultaneously\naddressing these tasks remains challenging and under-explored. Furthermore, it\nis essential to explore the potential of fusing cross-domain and cross-modality\ninformation across these interrelated tasks to further enhance the accuracy of\neach task. Thus, we propose a Dual-Domain Coarse-to-Fine Progressive Network\n(DuDoCFNet), a multi-task learning method for simultaneous LD denoising, LV\nreconstruction, and CT-free $\\mu$-map generation of cardiac SPECT. Paired\ndual-domain networks in DuDoCFNet are cascaded using a multi-layer fusion\nmechanism for cross-domain and cross-modality feature fusion. Two-stage\nprogressive learning strategies are applied in both projection and image\ndomains to achieve coarse-to-fine estimations of SPECT projections and\nCT-derived $\\mu$-maps. Our experiments demonstrate DuDoCFNet's superior\naccuracy in estimating projections, generating $\\mu$-maps, and AC\nreconstructions compared to existing single- or multi-task learning methods,\nunder various iterations and LD levels. The source code of this work is\navailable at https://github.com/XiongchaoChen/DuDoCFNet-MultiTask.\n","authors":["Xiongchao Chen","Bo Zhou","Xueqi Guo","Huidong Xie","Qiong Liu","James S. Duncan","Albert J. Sinusas","Chi Liu"],"pdf_url":"https://arxiv.org/pdf/2401.13140v1.pdf","comment":"11 Pages, 10 figures, 4 tables"},{"id":"http://arxiv.org/abs/2211.04625v2","updated":"2024-01-23T21:24:53Z","published":"2022-11-09T01:04:06Z","title":"Soft Augmentation for Image Classification","summary":" Modern neural networks are over-parameterized and thus rely on strong\nregularization such as data augmentation and weight decay to reduce overfitting\nand improve generalization. The dominant form of data augmentation applies\ninvariant transforms, where the learning target of a sample is invariant to the\ntransform applied to that sample. We draw inspiration from human visual\nclassification studies and propose generalizing augmentation with invariant\ntransforms to soft augmentation where the learning target softens non-linearly\nas a function of the degree of the transform applied to the sample: e.g., more\naggressive image crop augmentations produce less confident learning targets. We\ndemonstrate that soft targets allow for more aggressive data augmentation,\noffer more robust performance boosts, work with other augmentation policies,\nand interestingly, produce better calibrated models (since they are trained to\nbe less confident on aggressively cropped/occluded examples). Combined with\nexisting aggressive augmentation strategies, soft target 1) doubles the top-1\naccuracy boost across Cifar-10, Cifar-100, ImageNet-1K, and ImageNet-V2, 2)\nimproves model occlusion performance by up to $4\\times$, and 3) halves the\nexpected calibration error (ECE). Finally, we show that soft augmentation\ngeneralizes to self-supervised classification tasks. Code available at\nhttps://github.com/youngleox/soft_augmentation\n","authors":["Yang Liu","Shen Yan","Laura Leal-Taixé","James Hays","Deva Ramanan"],"pdf_url":"https://arxiv.org/pdf/2211.04625v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.13097v1","updated":"2024-01-23T21:22:06Z","published":"2024-01-23T21:22:06Z","title":"Digital Divides in Scene Recognition: Uncovering Socioeconomic Biases in\n Deep Learning Systems","summary":" Computer-based scene understanding has influenced fields ranging from urban\nplanning to autonomous vehicle performance, yet little is known about how well\nthese technologies work across social differences. We investigate the biases of\ndeep convolutional neural networks (dCNNs) in scene classification, using\nnearly one million images from global and US sources, including user-submitted\nhome photographs and Airbnb listings. We applied statistical models to quantify\nthe impact of socioeconomic indicators such as family income, Human Development\nIndex (HDI), and demographic factors from public data sources (CIA and US\nCensus) on dCNN performance. Our analyses revealed significant socioeconomic\nbias, where pretrained dCNNs demonstrated lower classification accuracy, lower\nclassification confidence, and a higher tendency to assign labels that could be\noffensive when applied to homes (e.g., \"ruin\", \"slum\"), especially in images\nfrom homes with lower socioeconomic status (SES). This trend is consistent\nacross two datasets of international images and within the diverse economic and\nracial landscapes of the United States. This research contributes to\nunderstanding biases in computer vision, emphasizing the need for more\ninclusive and representative training datasets. By mitigating the bias in the\ncomputer vision pipelines, we can ensure fairer and more equitable outcomes for\napplied computer vision, including home valuation and smart home security\nsystems. There is urgency in addressing these biases, which can significantly\nimpact critical decisions in urban development and resource allocation. Our\nfindings also motivate the development of AI systems that better understand and\nserve diverse communities, moving towards technology that equitably benefits\nall sectors of society.\n","authors":["Michelle R. Greene","Mariam Josyula","Wentao Si","Jennifer A. Hart"],"pdf_url":"https://arxiv.org/pdf/2401.13097v1.pdf","comment":"20 pages, 3 figures, 3 tables"},{"id":"http://arxiv.org/abs/2401.13087v1","updated":"2024-01-23T20:56:16Z","published":"2024-01-23T20:56:16Z","title":"Open-source data pipeline for street-view images: a case study on\n community mobility during COVID-19 pandemic","summary":" Street View Images (SVI) are a common source of valuable data for\nresearchers. Researchers have used SVI data for estimating pedestrian volumes,\ndemographic surveillance, and to better understand built and natural\nenvironments in cityscapes. However, the most common source of publicly\navailable SVI data is Google Street View. Google Street View images are\ncollected infrequently, making temporal analysis challenging, especially in low\npopulation density areas. Our main contribution is the development of an\nopen-source data pipeline for processing 360-degree video recorded from a\ncar-mounted camera. The video data is used to generate SVIs, which then can be\nused as an input for temporal analysis. We demonstrate the use of the pipeline\nby collecting a SVI dataset over a 38-month longitudinal survey of Seattle, WA,\nUSA during the COVID-19 pandemic. The output of our pipeline is validated\nthrough statistical analyses of pedestrian traffic in the images. We confirm\nknown results in the literature and provide new insights into outdoor\npedestrian traffic patterns. This study demonstrates the feasibility and value\nof collecting and using SVI for research purposes beyond what is possible with\ncurrently available SVI data. Limitations and future improvements on the data\npipeline and case study are also discussed.\n","authors":["Matthew Martell","Nick Terry","Ribhu Sengupta","Chris Salazar","Nicole A. Errett","Scott B. Miles","Joseph Wartman","Youngjun Choe"],"pdf_url":"https://arxiv.org/pdf/2401.13087v1.pdf","comment":"16 pages, 4 figures, two tables. Martell and Terry are equally\n contributing first authors"},{"id":"http://arxiv.org/abs/2306.08877v3","updated":"2024-01-23T20:55:48Z","published":"2023-06-15T06:21:44Z","title":"Linguistic Binding in Diffusion Models: Enhancing Attribute\n Correspondence through Attention Map Alignment","summary":" Text-conditioned image generation models often generate incorrect\nassociations between entities and their visual attributes. This reflects an\nimpaired mapping between linguistic binding of entities and modifiers in the\nprompt and visual binding of the corresponding elements in the generated image.\nAs one notable example, a query like \"a pink sunflower and a yellow flamingo\"\nmay incorrectly produce an image of a yellow sunflower and a pink flamingo. To\nremedy this issue, we propose SynGen, an approach which first syntactically\nanalyses the prompt to identify entities and their modifiers, and then uses a\nnovel loss function that encourages the cross-attention maps to agree with the\nlinguistic binding reflected by the syntax. Specifically, we encourage large\noverlap between attention maps of entities and their modifiers, and small\noverlap with other entities and modifier words. The loss is optimized during\ninference, without retraining or fine-tuning the model. Human evaluation on\nthree datasets, including one new and challenging set, demonstrate significant\nimprovements of SynGen compared with current state of the art methods. This\nwork highlights how making use of sentence structure during inference can\nefficiently and substantially improve the faithfulness of text-to-image\ngeneration.\n","authors":["Royi Rassin","Eran Hirsch","Daniel Glickman","Shauli Ravfogel","Yoav Goldberg","Gal Chechik"],"pdf_url":"https://arxiv.org/pdf/2306.08877v3.pdf","comment":"Accepted to NeurIPS 2023 (oral). Our code is publicly available at\n https://github.com/RoyiRa/Syntax-Guided-Generation"},{"id":"http://arxiv.org/abs/2309.07254v4","updated":"2024-01-23T20:43:50Z","published":"2023-09-13T18:43:13Z","title":"Mitigate Replication and Copying in Diffusion Models with Generalized\n Caption and Dual Fusion Enhancement","summary":" While diffusion models demonstrate a remarkable capability for generating\nhigh-quality images, their tendency to `replicate' training data raises privacy\nconcerns. Although recent research suggests that this replication may stem from\nthe insufficient generalization of training data captions and duplication of\ntraining images, effective mitigation strategies remain elusive. To address\nthis gap, our paper first introduces a generality score that measures the\ncaption generality and employ large language model (LLM) to generalize training\ncaptions. Subsequently, we leverage generalized captions and propose a novel\ndual fusion enhancement approach to mitigate the replication of diffusion\nmodels. Our empirical results demonstrate that our proposed methods can\nsignificantly reduce replication by 43.5% compared to the original diffusion\nmodel while maintaining the diversity and quality of generations. Code is\navailable at https://github.com/HowardLi0816/dual-fusion-diffusion.\n","authors":["Chenghao Li","Dake Chen","Yuke Zhang","Peter A. Beerel"],"pdf_url":"https://arxiv.org/pdf/2309.07254v4.pdf","comment":"This paper has been accepted for presentation at 2024 IEEE\n International Conference on Acoustics, Speech, and Signal Processing (ICASSP\n 2024)"},{"id":"http://arxiv.org/abs/2309.04447v3","updated":"2024-01-23T20:34:05Z","published":"2023-09-08T17:13:22Z","title":"Impact of Blur and Resolution on Demographic Disparities in 1-to-Many\n Facial Identification","summary":" Most studies to date that have examined demographic variations in face\nrecognition accuracy have analyzed 1-to-1 matching accuracy, using images that\ncould be described as \"government ID quality\". This paper analyzes the accuracy\nof 1-to-many facial identification across demographic groups, and in the\npresence of blur and reduced resolution in the probe image as might occur in\n\"surveillance camera quality\" images. Cumulative match characteristic curves\n(CMC) are not appropriate for comparing propensity for rank-one recognition\nerrors across demographics, and so we use three metrics for our analysis: (1)\nthe well-known d' metric between mated and non-mated score distributions, and\nintroduced in this work, (2) absolute score difference between thresholds in\nthe high-similarity tail of the non-mated and the low-similarity tail of the\nmated distribution, and (3) distribution of (mated - non-mated rank-one scores)\nacross the set of probe images. We find that demographic variation in 1-to-many\naccuracy does not entirely follow what has been observed in 1-to-1 matching\naccuracy. Also, different from 1-to-1 accuracy, demographic comparison of\n1-to-many accuracy can be affected by different numbers of identities and\nimages across demographics. More importantly, we show that increased blur in\nthe probe image, or reduced resolution of the face in the probe image, can\nsignificantly increase the false positive identification rate. And we show that\nthe demographic variation in these high blur or low resolution conditions is\nmuch larger for male / female than for African-American / Caucasian. The point\nthat 1-to-many accuracy can potentially collapse in the context of processing\n\"surveillance camera quality\" probe images against a \"government ID quality\"\ngallery is an important one.\n","authors":["Aman Bhatta","Gabriella Pangelinan","Michael C. King","Kevin W. Bowyer"],"pdf_url":"https://arxiv.org/pdf/2309.04447v3.pdf","comment":"9 pages, 8 figures, Conference submission"},{"id":"http://arxiv.org/abs/2401.13082v1","updated":"2024-01-23T20:28:06Z","published":"2024-01-23T20:28:06Z","title":"PlaceFormer: Transformer-based Visual Place Recognition using\n Multi-Scale Patch Selection and Fusion","summary":" Visual place recognition is a challenging task in the field of computer\nvision, and autonomous robotics and vehicles, which aims to identify a location\nor a place from visual inputs. Contemporary methods in visual place recognition\nemploy convolutional neural networks and utilize every region within the image\nfor the place recognition task. However, the presence of dynamic and\ndistracting elements in the image may impact the effectiveness of the place\nrecognition process. Therefore, it is meaningful to focus on task-relevant\nregions of the image for improved recognition. In this paper, we present\nPlaceFormer, a novel transformer-based approach for visual place recognition.\nPlaceFormer employs patch tokens from the transformer to create global image\ndescriptors, which are then used for image retrieval. To re-rank the retrieved\nimages, PlaceFormer merges the patch tokens from the transformer to form\nmulti-scale patches. Utilizing the transformer's self-attention mechanism, it\nselects patches that correspond to task-relevant areas in an image. These\nselected patches undergo geometric verification, generating similarity scores\nacross different patch sizes. Subsequently, spatial scores from each patch size\nare fused to produce a final similarity score. This score is then used to\nre-rank the images initially retrieved using global image descriptors.\nExtensive experiments on benchmark datasets demonstrate that PlaceFormer\noutperforms several state-of-the-art methods in terms of accuracy and\ncomputational efficiency, requiring less time and memory.\n","authors":["Shyam Sundar Kannan","Byung-Cheol Min"],"pdf_url":"https://arxiv.org/pdf/2401.13082v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.13081v1","updated":"2024-01-23T20:26:52Z","published":"2024-01-23T20:26:52Z","title":"Free Form Medical Visual Question Answering in Radiology","summary":" Visual Question Answering (VQA) in the medical domain presents a unique,\ninterdisciplinary challenge, combining fields such as Computer Vision, Natural\nLanguage Processing, and Knowledge Representation. Despite its importance,\nresearch in medical VQA has been scant, only gaining momentum since 2018.\nAddressing this gap, our research delves into the effective representation of\nradiology images and the joint learning of multimodal representations,\nsurpassing existing methods. We innovatively augment the SLAKE dataset,\nenabling our model to respond to a more diverse array of questions, not limited\nto the immediate content of radiology or pathology images. Our model achieves a\ntop-1 accuracy of 79.55\\% with a less complex architecture, demonstrating\ncomparable performance to current state-of-the-art models. This research not\nonly advances medical VQA but also opens avenues for practical applications in\ndiagnostic settings.\n","authors":["Abhishek Narayanan","Rushabh Musthyala","Rahul Sankar","Anirudh Prasad Nistala","Pranav Singh","Jacopo Cirrone"],"pdf_url":"https://arxiv.org/pdf/2401.13081v1.pdf","comment":"6 pages and 4 figures"},{"id":"http://arxiv.org/abs/2401.13076v1","updated":"2024-01-23T20:02:02Z","published":"2024-01-23T20:02:02Z","title":"SemanticSLAM: Learning based Semantic Map Construction and Robust Camera\n Localization","summary":" Current techniques in Visual Simultaneous Localization and Mapping (VSLAM)\nestimate camera displacement by comparing image features of consecutive scenes.\nThese algorithms depend on scene continuity, hence requires frequent camera\ninputs. However, processing images frequently can lead to significant memory\nusage and computation overhead. In this study, we introduce SemanticSLAM, an\nend-to-end visual-inertial odometry system that utilizes semantic features\nextracted from an RGB-D sensor. This approach enables the creation of a\nsemantic map of the environment and ensures reliable camera localization.\nSemanticSLAM is scene-agnostic, which means it doesn't require retraining for\ndifferent environments. It operates effectively in indoor settings, even with\ninfrequent camera input, without prior knowledge. The strength of SemanticSLAM\nlies in its ability to gradually refine the semantic map and improve pose\nestimation. This is achieved by a convolutional long-short-term-memory\n(ConvLSTM) network, trained to correct errors during map construction. Compared\nto existing VSLAM algorithms, SemanticSLAM improves pose estimation by 17%. The\nresulting semantic map provides interpretable information about the environment\nand can be easily applied to various downstream tasks, such as path planning,\nobstacle avoidance, and robot navigation. The code will be publicly available\nat https://github.com/Leomingyangli/SemanticSLAM\n","authors":["Mingyang Li","Yue Ma","Qinru Qiu"],"pdf_url":"https://arxiv.org/pdf/2401.13076v1.pdf","comment":"2023 IEEE Symposium Series on Computational Intelligence (SSCI) 6\n pages"},{"id":"http://arxiv.org/abs/2401.13068v1","updated":"2024-01-23T19:48:34Z","published":"2024-01-23T19:48:34Z","title":"Local Background Estimation for Improved Gas Plume Identification in\n Hyperspectral Images","summary":" Deep learning identification models have shown promise for identifying gas\nplumes in Longwave IR hyperspectral images of urban scenes, particularly when a\nlarge library of gases are being considered. Because many gases have similar\nspectral signatures, it is important to properly estimate the signal from a\ndetected plume. Typically, a scene's global mean spectrum and covariance matrix\nare estimated to whiten the plume's signal, which removes the background's\nsignature from the gas signature. However, urban scenes can have many different\nbackground materials that are spatially and spectrally heterogeneous. This can\nlead to poor identification performance when the global background estimate is\nnot representative of a given local background material. We use image\nsegmentation, along with an iterative background estimation algorithm, to\ncreate local estimates for the various background materials that reside\nunderneath a gas plume. Our method outperforms global background estimation on\na set of simulated and real gas plumes. This method shows promise in increasing\ndeep learning identification confidence, while being simple and easy to tune\nwhen considering diverse plumes.\n","authors":["Scout Jarman","Zigfried Hampel-Arias","Adra Carr","Kevin R. Moon"],"pdf_url":"https://arxiv.org/pdf/2401.13068v1.pdf","comment":"Submitted to International Geoscience and Remote Sensing Symposium\n (IGARSS), 2024. 5 pages, 2 figures"},{"id":"http://arxiv.org/abs/2401.13051v1","updated":"2024-01-23T19:20:22Z","published":"2024-01-23T19:20:22Z","title":"PA-SAM: Prompt Adapter SAM for High-Quality Image Segmentation","summary":" The Segment Anything Model (SAM) has exhibited outstanding performance in\nvarious image segmentation tasks. Despite being trained with over a billion\nmasks, SAM faces challenges in mask prediction quality in numerous scenarios,\nespecially in real-world contexts. In this paper, we introduce a novel\nprompt-driven adapter into SAM, namely Prompt Adapter Segment Anything Model\n(PA-SAM), aiming to enhance the segmentation mask quality of the original SAM.\nBy exclusively training the prompt adapter, PA-SAM extracts detailed\ninformation from images and optimizes the mask decoder feature at both sparse\nand dense prompt levels, improving the segmentation performance of SAM to\nproduce high-quality masks. Experimental results demonstrate that our PA-SAM\noutperforms other SAM-based methods in high-quality, zero-shot, and open-set\nsegmentation. We're making the source code and models available at\nhttps://github.com/xzz2/pa-sam.\n","authors":["Zhaozhi Xie","Bochen Guan","Weihao Jiang","Muyang Yi","Yue Ding","Hongtao Lu","Lei Zhang"],"pdf_url":"https://arxiv.org/pdf/2401.13051v1.pdf","comment":"Code is available at https://github.com/xzz2/pa-sam"},{"id":"http://arxiv.org/abs/2401.13049v1","updated":"2024-01-23T19:17:20Z","published":"2024-01-23T19:17:20Z","title":"CIS-UNet: Multi-Class Segmentation of the Aorta in Computed Tomography\n Angiography via Context-Aware Shifted Window Self-Attention","summary":" Advancements in medical imaging and endovascular grafting have facilitated\nminimally invasive treatments for aortic diseases. Accurate 3D segmentation of\nthe aorta and its branches is crucial for interventions, as inaccurate\nsegmentation can lead to erroneous surgical planning and endograft\nconstruction. Previous methods simplified aortic segmentation as a binary image\nsegmentation problem, overlooking the necessity of distinguishing between\nindividual aortic branches. In this paper, we introduce Context Infused\nSwin-UNet (CIS-UNet), a deep learning model designed for multi-class\nsegmentation of the aorta and thirteen aortic branches. Combining the strengths\nof Convolutional Neural Networks (CNNs) and Swin transformers, CIS-UNet adopts\na hierarchical encoder-decoder structure comprising a CNN encoder, symmetric\ndecoder, skip connections, and a novel Context-aware Shifted Window\nSelf-Attention (CSW-SA) as the bottleneck block. Notably, CSW-SA introduces a\nunique utilization of the patch merging layer, distinct from conventional Swin\ntransformers. It efficiently condenses the feature map, providing a global\nspatial context and enhancing performance when applied at the bottleneck layer,\noffering superior computational efficiency and segmentation accuracy compared\nto the Swin transformers. We trained our model on computed tomography (CT)\nscans from 44 patients and tested it on 15 patients. CIS-UNet outperformed the\nstate-of-the-art SwinUNetR segmentation model, which is solely based on Swin\ntransformers, by achieving a superior mean Dice coefficient of 0.713 compared\nto 0.697, and a mean surface distance of 2.78 mm compared to 3.39 mm.\nCIS-UNet's superior 3D aortic segmentation offers improved precision and\noptimization for planning endovascular treatments. Our dataset and code will be\npublicly available.\n","authors":["Muhammad Imran","Jonathan R Krebs","Veera Rajasekhar Reddy Gopu","Brian Fazzone","Vishal Balaji Sivaraman","Amarjeet Kumar","Chelsea Viscardi","Robert Evans Heithaus","Benjamin Shickel","Yuyin Zhou","Michol A Cooper","Wei Shao"],"pdf_url":"https://arxiv.org/pdf/2401.13049v1.pdf","comment":null},{"id":"http://arxiv.org/abs/1811.08075v2","updated":"2024-01-23T19:16:31Z","published":"2018-11-20T04:55:07Z","title":"Scene Graph Generation via Conditional Random Fields","summary":" Despite the great success object detection and segmentation models have\nachieved in recognizing individual objects in images, performance on cognitive\ntasks such as image caption, semantic image retrieval, and visual QA is far\nfrom satisfactory. To achieve better performance on these cognitive tasks,\nmerely recognizing individual object instances is insufficient. Instead, the\ninteractions between object instances need to be captured in order to\nfacilitate reasoning and understanding of the visual scenes in an image. Scene\ngraph, a graph representation of images that captures object instances and\ntheir relationships, offers a comprehensive understanding of an image. However,\nexisting techniques on scene graph generation fail to distinguish subjects and\nobjects in the visual scenes of images and thus do not perform well with\nreal-world datasets where exist ambiguous object instances. In this work, we\npropose a novel scene graph generation model for predicting object instances\nand its corresponding relationships in an image. Our model, SG-CRF, learns the\nsequential order of subject and object in a relationship triplet, and the\nsemantic compatibility of object instance nodes and relationship nodes in a\nscene graph efficiently. Experiments empirically show that SG-CRF outperforms\nthe state-of-the-art methods, on three different datasets, i.e., CLEVR, VRD,\nand Visual Genome, raising the Recall@100 from 24.99% to 49.95%, from 41.92% to\n50.47%, and from 54.69% to 54.77%, respectively.\n","authors":["Weilin Cong","William Wang","Wang-Chien Lee"],"pdf_url":"https://arxiv.org/pdf/1811.08075v2.pdf","comment":"Need to withdraw this draft as requested by collaborators"},{"id":"http://arxiv.org/abs/2401.13011v1","updated":"2024-01-23T11:46:28Z","published":"2024-01-23T11:46:28Z","title":"CCA: Collaborative Competitive Agents for Image Editing","summary":" This paper presents a novel generative model, Collaborative Competitive\nAgents (CCA), which leverages the capabilities of multiple Large Language\nModels (LLMs) based agents to execute complex tasks. Drawing inspiration from\nGenerative Adversarial Networks (GANs), the CCA system employs two equal-status\ngenerator agents and a discriminator agent. The generators independently\nprocess user instructions and generate results, while the discriminator\nevaluates the outputs, and provides feedback for the generator agents to\nfurther reflect and improve the generation results. Unlike the previous\ngenerative model, our system can obtain the intermediate steps of generation.\nThis allows each generator agent to learn from other successful executions due\nto its transparency, enabling a collaborative competition that enhances the\nquality and robustness of the system's results. The primary focus of this study\nis image editing, demonstrating the CCA's ability to handle intricate\ninstructions robustly. The paper's main contributions include the introduction\nof a multi-agent-based generative model with controllable intermediate steps\nand iterative optimization, a detailed examination of agent relationships, and\ncomprehensive experiments on image editing. Code is available at\n\\href{https://github.com/TiankaiHang/CCA}{https://github.com/TiankaiHang/CCA}.\n","authors":["Tiankai Hang","Shuyang Gu","Dong Chen","Xin Geng","Baining Guo"],"pdf_url":"https://arxiv.org/pdf/2401.13011v1.pdf","comment":null}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2401.12798v1","updated":"2024-01-23T14:31:12Z","published":"2024-01-23T14:31:12Z","title":"Gradient Flow of Energy: A General and Efficient Approach for Entity\n Alignment Decoding","summary":" Entity alignment (EA), a pivotal process in integrating multi-source\nKnowledge Graphs (KGs), seeks to identify equivalent entity pairs across these\ngraphs. Most existing approaches regard EA as a graph representation learning\ntask, concentrating on enhancing graph encoders. However, the decoding process\nin EA - essential for effective operation and alignment accuracy - has received\nlimited attention and remains tailored to specific datasets and model\narchitectures, necessitating both entity and additional explicit relation\nembeddings. This specificity limits its applicability, particularly in\nGNN-based models. To address this gap, we introduce a novel, generalized, and\nefficient decoding approach for EA, relying solely on entity embeddings. Our\nmethod optimizes the decoding process by minimizing Dirichlet energy, leading\nto the gradient flow within the graph, to promote graph homophily. The\ndiscretization of the gradient flow produces a fast and scalable approach,\ntermed Triple Feature Propagation (TFP). TFP innovatively channels gradient\nflow through three views: entity-to-entity, entity-to-relation, and\nrelation-to-entity. This generalized gradient flow enables TFP to harness the\nmulti-view structural information of KGs. Rigorous experimentation on diverse\nreal-world datasets demonstrates that our approach significantly enhances\nvarious EA methods. Notably, the approach achieves these advancements with less\nthan 6 seconds of additional computational time, establishing a new benchmark\nin efficiency and adaptability for future EA methods.\n","authors":["Yuanyi Wang","Haifeng Sun","Jingyu Wang","Qi Qi","Shaoling Sun","Jianxin Liao"],"pdf_url":"https://arxiv.org/pdf/2401.12798v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.16716v4","updated":"2024-01-23T14:05:58Z","published":"2023-11-28T12:00:06Z","title":"GraphPro: Graph Pre-training and Prompt Learning for Recommendation","summary":" GNN-based recommenders have excelled in modeling intricate user-item\ninteractions through multi-hop message passing. However, existing methods often\noverlook the dynamic nature of evolving user-item interactions, which impedes\nthe adaption to changing user preferences and distribution shifts in newly\narriving data. Thus, their scalability and performances in real-world dynamic\nenvironments are limited. In this study, we propose GraphPro, a framework that\nincorporates parameter-efficient and dynamic graph pre-training with prompt\nlearning. This novel combination empowers GNNs to effectively capture both\nlong-term user preferences and short-term behavior dynamics, enabling the\ndelivery of accurate and timely recommendations. Our GraphPro framework\naddresses the challenge of evolving user preferences by seamlessly integrating\na temporal prompt mechanism and a graph-structural prompt learning mechanism\ninto the pre-trained GNN model. The temporal prompt mechanism encodes time\ninformation on user-item interaction, allowing the model to naturally capture\ntemporal context, while the graph-structural prompt learning mechanism enables\nthe transfer of pre-trained knowledge to adapt to behavior dynamics without the\nneed for continuous incremental training. We further bring in a dynamic\nevaluation setting for recommendation to mimic real-world dynamic scenarios and\nbridge the offline-online gap to a better level. Our extensive experiments\nincluding a large-scale industrial deployment showcases the lightweight plug-in\nscalability of our GraphPro when integrated with various state-of-the-art\nrecommenders, emphasizing the advantages of GraphPro in terms of effectiveness,\nrobustness and efficiency.\n","authors":["Yuhao Yang","Lianghao Xia","Da Luo","Kangyi Lin","Chao Huang"],"pdf_url":"https://arxiv.org/pdf/2311.16716v4.pdf","comment":"Accepted by WWW'2024, full paper"},{"id":"http://arxiv.org/abs/2310.14037v2","updated":"2024-01-23T13:40:30Z","published":"2023-10-21T15:21:39Z","title":"Unlock Multi-Modal Capability of Dense Retrieval via Visual Module\n Plugin","summary":" This paper proposes Multi-modAl Retrieval model via Visual modulE pLugin\n(MARVEL) to learn an embedding space for queries and multi-modal documents to\nconduct retrieval. MARVEL encodes queries and multi-modal documents with a\nunified encoder model, which helps to alleviate the modality gap between images\nand texts. Specifically, we enable the image understanding ability of a\nwell-trained dense retriever, T5-ANCE, by incorporating the image features\nencoded by the visual module as its inputs. To facilitate the multi-modal\nretrieval tasks, we build the ClueWeb22-MM dataset based on the ClueWeb22\ndataset, which regards anchor texts as queries, and exact the related texts and\nimage documents from anchor linked web pages. Our experiments show that MARVEL\nsignificantly outperforms the state-of-the-art methods on the multi-modal\nretrieval dataset WebQA and ClueWeb22-MM. Our further analyses show that the\nvisual module plugin method is tailored to enable the image understanding\nability for an existing dense retrieval model. Besides, we also show that the\nlanguage model has the ability to extract image semantics from image encoders\nand adapt the image features in the input space of language models. All codes\nare available at https://github.com/OpenMatch/MARVEL.\n","authors":["Tianshuo Zhou","Sen Mei","Xinze Li","Zhenghao Liu","Chenyan Xiong","Zhiyuan Liu","Yu Gu","Ge Yu"],"pdf_url":"https://arxiv.org/pdf/2310.14037v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12732v1","updated":"2024-01-23T13:06:19Z","published":"2024-01-23T13:06:19Z","title":"CDRNP: Cross-Domain Recommendation to Cold-Start Users via Neural\n Process","summary":" Cross-domain recommendation (CDR) has been proven as a promising way to\ntackle the user cold-start problem, which aims to make recommendations for\nusers in the target domain by transferring the user preference derived from the\nsource domain. Traditional CDR studies follow the embedding and mapping (EMCDR)\nparadigm, which transfers user representations from the source to target domain\nby learning a user-shared mapping function, neglecting the user-specific\npreference. Recent CDR studies attempt to learn user-specific mapping functions\nin meta-learning paradigm, which regards each user's CDR as an individual task,\nbut neglects the preference correlations among users, limiting the beneficial\ninformation for user representations. Moreover, both of the paradigms neglect\nthe explicit user-item interactions from both domains during the mapping\nprocess. To address the above issues, this paper proposes a novel CDR framework\nwith neural process (NP), termed as CDRNP. Particularly, it develops the\nmeta-learning paradigm to leverage user-specific preference, and further\nintroduces a stochastic process by NP to capture the preference correlations\namong the overlapping and cold-start users, thus generating more powerful\nmapping functions by mapping the user-specific preference and common preference\ncorrelations to a predictive probability distribution. In addition, we also\nintroduce a preference remainer to enhance the common preference from the\noverlapping users, and finally devises an adaptive conditional decoder with\npreference modulation to make prediction for cold-start users with items in the\ntarget domain. Experimental results demonstrate that CDRNP outperforms previous\nSOTA methods in three real-world CDR scenarios.\n","authors":["Xiaodong Li","Jiawei Sheng","Jiangxia Cao","Wenyuan Zhang","Quangang Li","Tingwen Liu"],"pdf_url":"https://arxiv.org/pdf/2401.12732v1.pdf","comment":"This paper is accepted by WSDM'2024 Oral"},{"id":"http://arxiv.org/abs/2401.12593v1","updated":"2024-01-23T09:48:08Z","published":"2024-01-23T09:48:08Z","title":"MOReGIn: Multi-Objective Recommendation at the Global and Individual\n Levels","summary":" Multi-Objective Recommender Systems (MORSs) emerged as a paradigm to\nguarantee multiple (often conflicting) goals. Besides accuracy, a MORS can\noperate at the global level, where additional beyond-accuracy goals are met for\nthe system as a whole, or at the individual level, meaning that the\nrecommendations are tailored to the needs of each user. The state-of-the-art\nMORSs either operate at the global or individual level, without assuming the\nco-existence of the two perspectives. In this study, we show that when global\nand individual objectives co-exist, MORSs are not able to meet both types of\ngoals. To overcome this issue, we present an approach that regulates the\nrecommendation lists so as to guarantee both global and individual\nperspectives, while preserving its effectiveness. Specifically, as individual\nperspective, we tackle genre calibration and, as global perspective, provider\nfairness. We validate our approach on two real-world datasets, publicly\nreleased with this paper.\n","authors":["Elizabeth Gómez","David Contreras","Ludovico Boratto","Maria Salamó"],"pdf_url":"https://arxiv.org/pdf/2401.12593v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12590v1","updated":"2024-01-23T09:45:49Z","published":"2024-01-23T09:45:49Z","title":"PolyCF: Towards the Optimal Spectral Graph Filters for Collaborative\n Filtering","summary":" Collaborative Filtering (CF) is a pivotal research area in recommender\nsystems that capitalizes on collaborative similarities between users and items\nto provide personalized recommendations. With the remarkable achievements of\nnode embedding-based Graph Neural Networks (GNNs), we explore the upper bounds\nof expressiveness inherent to embedding-based methodologies and tackle the\nchallenges by reframing the CF task as a graph signal processing problem. To\nthis end, we propose PolyCF, a flexible graph signal filter that leverages\npolynomial graph filters to process interaction signals. PolyCF exhibits the\ncapability to capture spectral features across multiple eigenspaces through a\nseries of Generalized Gram filters and is able to approximate the optimal\npolynomial response function for recovering missing interactions. A graph\noptimization objective and a pair-wise ranking objective are jointly used to\noptimize the parameters of the convolution kernel. Experiments on three widely\nadopted datasets demonstrate the superiority of PolyCF over current\nstate-of-the-art CF methods. Moreover, comprehensive studies empirically\nvalidate each component's efficacy in the proposed PolyCF.\n","authors":["Yifang Qin","Wei Ju","Xiao Luo","Yiyang Gu","Ming Zhang"],"pdf_url":"https://arxiv.org/pdf/2401.12590v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12553v1","updated":"2024-01-23T08:24:44Z","published":"2024-01-23T08:24:44Z","title":"InfoRank: Unbiased Learning-to-Rank via Conditional Mutual Information\n Minimization","summary":" Ranking items regarding individual user interests is a core technique of\nmultiple downstream tasks such as recommender systems. Learning such a\npersonalized ranker typically relies on the implicit feedback from users' past\nclick-through behaviors. However, collected feedback is biased toward\npreviously highly-ranked items and directly learning from it would result in a\n\"rich-get-richer\" phenomenon. In this paper, we propose a simple yet sufficient\nunbiased learning-to-rank paradigm named InfoRank that aims to simultaneously\naddress both position and popularity biases. We begin by consolidating the\nimpacts of those biases into a single observation factor, thereby providing a\nunified approach to addressing bias-related issues. Subsequently, we minimize\nthe mutual information between the observation estimation and the relevance\nestimation conditioned on the input features. By doing so, our relevance\nestimation can be proved to be free of bias. To implement InfoRank, we first\nincorporate an attention mechanism to capture latent correlations within\nuser-item features, thereby generating estimations of observation and\nrelevance. We then introduce a regularization term, grounded in conditional\nmutual information, to promote conditional independence between relevance\nestimation and observation estimation. Experimental evaluations conducted\nacross three extensive recommendation and search datasets reveal that InfoRank\nlearns more precise and unbiased ranking strategies.\n","authors":["Jiarui Jin","Zexue He","Mengyue Yang","Weinan Zhang","Yong Yu","Jun Wang","Julian McAuley"],"pdf_url":"https://arxiv.org/pdf/2401.12553v1.pdf","comment":"WWW 2024"},{"id":"http://arxiv.org/abs/2212.12970v3","updated":"2024-01-23T07:57:55Z","published":"2022-12-25T23:19:56Z","title":"Refined Edge Usage of Graph Neural Networks for Edge Prediction","summary":" Graph Neural Networks (GNNs), originally proposed for node classification,\nhave also motivated many recent works on edge prediction (a.k.a., link\nprediction). However, existing methods lack elaborate design regarding the\ndistinctions between two tasks that have been frequently overlooked: (i) edges\nonly constitute the topology in the node classification task but can be used as\nboth the topology and the supervisions (i.e., labels) in the edge prediction\ntask; (ii) the node classification makes prediction over each individual node,\nwhile the edge prediction is determinated by each pair of nodes. To this end,\nwe propose a novel edge prediction paradigm named Edge-aware Message PassIng\nneuRal nEtworks (EMPIRE). Concretely, we first introduce an edge splitting\ntechnique to specify use of each edge where each edge is solely used as either\nthe topology or the supervision (named as topology edge or supervision edge).\nWe then develop a new message passing mechanism that generates the messages to\nsource nodes (through topology edges) being aware of target nodes (through\nsupervision edges). In order to emphasize the differences between pairs\nconnected by supervision edges and pairs unconnected, we further weight the\nmessages to highlight the relative ones that can reflect the differences. In\naddition, we design a novel negative node-pair sampling trick that efficiently\nsamples 'hard' negative instances in the supervision instances, and can\nsignificantly improve the performance. Experimental results verify that the\nproposed method can significantly outperform existing state-of-the-art models\nregarding the edge prediction task on multiple homogeneous and heterogeneous\ngraph datasets.\n","authors":["Jiarui Jin","Yangkun Wang","Weinan Zhang","Quan Gan","Xiang Song","Yong Yu","Zheng Zhang","David Wipf"],"pdf_url":"https://arxiv.org/pdf/2212.12970v3.pdf","comment":"Need major revisions"},{"id":"http://arxiv.org/abs/2310.03025v2","updated":"2024-01-23T07:49:13Z","published":"2023-10-04T17:59:41Z","title":"Retrieval meets Long Context Large Language Models","summary":" Extending the context window of large language models (LLMs) is getting\npopular recently, while the solution of augmenting LLMs with retrieval has\nexisted for years. The natural questions are: i) Retrieval-augmentation versus\nlong context window, which one is better for downstream tasks? ii) Can both\nmethods be combined to get the best of both worlds? In this work, we answer\nthese questions by studying both solutions using two state-of-the-art\npretrained LLMs, i.e., a proprietary 43B GPT and Llama2-70B. Perhaps\nsurprisingly, we find that LLM with 4K context window using simple\nretrieval-augmentation at generation can achieve comparable performance to\nfinetuned LLM with 16K context window via positional interpolation on long\ncontext tasks, while taking much less computation. More importantly, we\ndemonstrate that retrieval can significantly improve the performance of LLMs\nregardless of their extended context window sizes. Our best model,\nretrieval-augmented Llama2-70B with 32K context window, outperforms\nGPT-3.5-turbo-16k and Davinci003 in terms of average score on nine long context\ntasks including question answering, query-based summarization, and in-context\nfew-shot learning tasks. It also outperforms its non-retrieval Llama2-70B-32k\nbaseline by a margin, while being much faster at generation. Our study provides\ngeneral insights on the choice of retrieval-augmentation versus long context\nextension of LLM for practitioners.\n","authors":["Peng Xu","Wei Ping","Xianchao Wu","Lawrence McAfee","Chen Zhu","Zihan Liu","Sandeep Subramanian","Evelina Bakhturina","Mohammad Shoeybi","Bryan Catanzaro"],"pdf_url":"https://arxiv.org/pdf/2310.03025v2.pdf","comment":"Published at ICLR 2024"},{"id":"http://arxiv.org/abs/2401.12540v1","updated":"2024-01-23T07:48:58Z","published":"2024-01-23T07:48:58Z","title":"DREditor: An Time-efficient Approach for Building a Domain-specific\n Dense Retrieval Model","summary":" Deploying dense retrieval models efficiently is becoming increasingly\nimportant across various industries. This is especially true for enterprise\nsearch services, where customizing search engines to meet the time demands of\ndifferent enterprises in different domains is crucial. Motivated by this, we\ndevelop a time-efficient approach called DREditor to edit the matching rule of\nan off-the-shelf dense retrieval model to suit a specific domain. This is\nachieved by directly calibrating the output embeddings of the model using an\nefficient and effective linear mapping. This mapping is powered by an edit\noperator that is obtained by solving a specially constructed least squares\nproblem. Compared to implicit rule modification via long-time finetuning, our\nexperimental results show that DREditor provides significant advantages on\ndifferent domain-specific datasets, dataset sources, retrieval models, and\ncomputing devices. It consistently enhances time efficiency by 100-300 times\nwhile maintaining comparable or even superior retrieval performance. In a\nbroader context, we take the first step to introduce a novel embedding\ncalibration approach for the retrieval task, filling the technical blank in the\ncurrent field of embedding calibration. This approach also paves the way for\nbuilding domain-specific dense retrieval models efficiently and inexpensively.\n","authors":["Chen Huang","Duanyu Feng","Wenqiang Lei","Jiancheng Lv"],"pdf_url":"https://arxiv.org/pdf/2401.12540v1.pdf","comment":"15 pages, 6 figures, Codes are available at\n https://github.com/huangzichun/DREditor"},{"id":"http://arxiv.org/abs/2401.12520v1","updated":"2024-01-23T06:30:05Z","published":"2024-01-23T06:30:05Z","title":"Key Information Retrieval to Classify the Unstructured Data Content of\n Preferential Trade Agreements","summary":" With the rapid proliferation of textual data, predicting long texts has\nemerged as a significant challenge in the domain of natural language\nprocessing. Traditional text prediction methods encounter substantial\ndifficulties when grappling with long texts, primarily due to the presence of\nredundant and irrelevant information, which impedes the model's capacity to\ncapture pivotal insights from the text. To address this issue, we introduce a\nnovel approach to long-text classification and prediction. Initially, we employ\nembedding techniques to condense the long texts, aiming to diminish the\nredundancy therein. Subsequently,the Bidirectional Encoder Representations from\nTransformers (BERT) embedding method is utilized for text classification\ntraining. Experimental outcomes indicate that our method realizes considerable\nperformance enhancements in classifying long texts of Preferential Trade\nAgreements. Furthermore, the condensation of text through embedding methods not\nonly augments prediction accuracy but also substantially reduces computational\ncomplexity. Overall, this paper presents a strategy for long-text prediction,\noffering a valuable reference for researchers and engineers in the natural\nlanguage processing sphere.\n","authors":["Jiahui Zhao","Ziyi Meng","Stepan Gordeev","Zijie Pan","Dongjin Song","Sandro Steinbach","Caiwen Ding"],"pdf_url":"https://arxiv.org/pdf/2401.12520v1.pdf","comment":"AI4TS Workshop@AAAI 2024 accepted publication"},{"id":"http://arxiv.org/abs/2401.10225v2","updated":"2024-01-23T05:04:32Z","published":"2024-01-18T18:59:11Z","title":"ChatQA: Building GPT-4 Level Conversational QA Models","summary":" In this work, we introduce ChatQA, a family of conversational question\nanswering (QA) models that obtain GPT-4 level accuracies. Specifically, we\npropose a two-stage instruction tuning method that can significantly improve\nthe zero-shot conversational QA results from large language models (LLMs). To\nhandle retrieval-augmented generation in conversational QA, we fine-tune a\ndense retriever on a multi-turn QA dataset, which provides comparable results\nto using the state-of-the-art query rewriting model while largely reducing\ndeployment cost. Notably, our ChatQA-70B can outperform GPT-4 in terms of\naverage score on 10 conversational QA datasets (54.14 vs. 53.90), without\nrelying on any synthetic data from OpenAI GPT models.\n","authors":["Zihan Liu","Wei Ping","Rajarshi Roy","Peng Xu","Chankyu Lee","Mohammad Shoeybi","Bryan Catanzaro"],"pdf_url":"https://arxiv.org/pdf/2401.10225v2.pdf","comment":"We added ChatQA-22B results"},{"id":"http://arxiv.org/abs/2309.09085v3","updated":"2024-01-23T05:02:45Z","published":"2023-09-16T19:40:30Z","title":"SynthTab: Leveraging Synthesized Data for Guitar Tablature Transcription","summary":" Guitar tablature is a form of music notation widely used among guitarists. It\ncaptures not only the musical content of a piece, but also its implementation\nand ornamentation on the instrument. Guitar Tablature Transcription (GTT) is an\nimportant task with broad applications in music education, composition, and\nentertainment. Existing GTT datasets are quite limited in size and scope,\nrendering models trained on them prone to overfitting and incapable of\ngeneralizing to out-of-domain data. In order to address this issue, we present\na methodology for synthesizing large-scale GTT audio using commercial acoustic\nand electric guitar plugins. We procure SynthTab, a dataset derived from\nDadaGP, which is a vast and diverse collection of richly annotated symbolic\ntablature. The proposed synthesis pipeline produces audio which faithfully\nadheres to the original fingerings and a subset of techniques specified in the\ntablature, and covers multiple guitars and styles for each track. Experiments\nshow that pre-training a baseline GTT model on SynthTab can improve\ntranscription performance when fine-tuning and testing on an individual\ndataset. More importantly, cross-dataset experiments show that pre-training\nsignificantly mitigates issues with overfitting.\n","authors":["Yongyi Zang","Yi Zhong","Frank Cwitkowitz","Zhiyao Duan"],"pdf_url":"https://arxiv.org/pdf/2309.09085v3.pdf","comment":"Accepted to ICASSP 2024"},{"id":"http://arxiv.org/abs/2401.12483v1","updated":"2024-01-23T04:32:32Z","published":"2024-01-23T04:32:32Z","title":"Persona-centric Metamorphic Relation guided Robustness Evaluation for\n Multi-turn Dialogue Modelling","summary":" Recently there has been significant progress in the field of dialogue system\nthanks to the introduction of training paradigms such as fine-tune and prompt\nlearning. Persona can function as the prior knowledge for maintaining the\npersonality consistency of dialogue systems, which makes it perform well on\naccuracy. Nonetheless, the conventional reference-based evaluation method falls\nshort in capturing the genuine text comprehension prowess of the model,\nsignificantly relying on the quality of data annotation. In contrast, the\napplication of metamorphic testing offers a more profound insight into the\nmodel's distinct capabilities without necessitating supplementary annotation\nlabels. This approach furnishes a more comprehensive portrayal of the model's\nintricacies and exposes intricacies concealed within reference-based validation\ntechniques. Consequently, we introduce a persona-centric metamorphic relation\nconstruction for metamorphic testing, aimed at evaluating both the persona\nconsistency and robustness of personalized dialogue models. For that reason,\nthis work evaluates several widely used training paradigms including learning\nfrom scratch, pretrain + fine-tune and prompt learning in personalized dialogue\nretrieval to know if they are more robust or if they have the same flaws as\ntheir predecessor. Under three kinds of designed metamorphic relations with\nconsistent outputs, our experimental results reveal that prompt learning shows\nstronger robustness compared to training from scratch and fine-tune. Although\ntested retrieval models gain competitively high retrieval accuracy according to\nthe traditional reference-based validation, they are still fragile and\ndemonstrate various unexpected behaviors, thus there is still room for future\nimprovement in personalized dialogue retrieval.\n","authors":["Yanbing Chen","Lin Li","Xiaohui Tao","Dong Zhou"],"pdf_url":"https://arxiv.org/pdf/2401.12483v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11624v2","updated":"2024-01-23T03:35:40Z","published":"2024-01-21T23:34:42Z","title":"In-context Learning with Retrieved Demonstrations for Language Models: A\n Survey","summary":" Language models, especially pre-trained large language models, have showcased\nremarkable abilities as few-shot in-context learners (ICL), adept at adapting\nto new tasks with just a few demonstrations in the input context. However, the\nmodel's ability to perform ICL is sensitive to the choice of the few-shot\ndemonstrations. Instead of using a fixed set of demonstrations, one recent\ndevelopment is to retrieve demonstrations tailored to each input query. The\nimplementation of demonstration retrieval is relatively straightforward,\nleveraging existing databases and retrieval systems. This not only improves the\nefficiency and scalability of the learning process but also has been shown to\nreduce biases inherent in manual example selection. In light of the encouraging\nresults and growing research in ICL with retrieved demonstrations, we conduct\nan extensive review of studies in this area. In this survey, we discuss and\ncompare different design choices for retrieval models, retrieval training\nprocedures, and inference algorithms.\n","authors":["Man Luo","Xin Xu","Yue Liu","Panupong Pasupat","Mehran Kazemi"],"pdf_url":"https://arxiv.org/pdf/2401.11624v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12445v1","updated":"2024-01-23T02:24:17Z","published":"2024-01-23T02:24:17Z","title":"Session-level Normalization and Click-through Data Enhancement for\n Session-based Evaluation","summary":" Since a user usually has to issue a sequence of queries and examine multiple\ndocuments to resolve a complex information need in a search session,\nresearchers have paid much attention to evaluating search systems at the\nsession level rather than the single-query level. Most existing session-level\nmetrics evaluate each query separately and then aggregate the query-level\nscores using a session-level weighting function. The assumptions behind these\nmetrics are that all queries in the session should be involved, and their\norders are fixed. However, if a search system could make the user satisfied\nwith her first few queries, she may not need any subsequent queries. Besides,\nin most real-world search scenarios, due to a lack of explicit feedback from\nreal users, we can only leverage some implicit feedback, such as users' clicks,\nas relevance labels for offline evaluation. Such implicit feedback might be\ndifferent from the real relevance in a search session as some documents may be\nomitted in the previous query but identified in the later reformulations. To\naddress the above issues, we make two assumptions about session-based\nevaluation, which explicitly describe an ideal session-search system and how to\nenhance click-through data in computing session-level evaluation metrics. Based\non our assumptions, we design a session-level metric called Normalized\nU-Measure (NUM). NUM evaluates a session as a whole and utilizes an ideal\nsession to normalize the result of the actual session. Besides, it infers\nsession-level relevance labels based on implicit feedback. Experiments on two\npublic datasets demonstrate the effectiveness of NUM by comparing it with\nexisting session-based metrics in terms of correlation with user satisfaction\nand intuitiveness. We also conduct ablation studies to explore whether these\nassumptions hold.\n","authors":["Haonan Chen","Zhicheng Dou","Jiaxin Mao"],"pdf_url":"https://arxiv.org/pdf/2401.12445v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.11478v2","updated":"2024-01-23T02:22:51Z","published":"2024-01-21T12:51:28Z","title":"D2K: Turning Historical Data into Retrievable Knowledge for Recommender\n Systems","summary":" A vast amount of user behavior data is constantly accumulating on today's\nlarge recommendation platforms, recording users' various interests and tastes.\nPreserving knowledge from the old data while new data continually arrives is a\nvital problem for recommender systems. Existing approaches generally seek to\nsave the knowledge implicitly in the model parameters. However, such a\nparameter-centric approach lacks scalability and flexibility -- the capacity is\nhard to scale, and the knowledge is inflexible to utilize. Hence, in this work,\nwe propose a framework that turns massive user behavior data to retrievable\nknowledge (D2K). It is a data-centric approach that is model-agnostic and easy\nto scale up. Different from only storing unary knowledge such as the user-side\nor item-side information, D2K propose to store ternary knowledge for\nrecommendation, which is determined by the complete recommendation factors --\nuser, item, and context. The knowledge retrieved by target samples can be\ndirectly used to enhance the performance of any recommendation algorithms.\nSpecifically, we introduce a Transformer-based knowledge encoder to transform\nthe old data into knowledge with the user-item-context cross features. A\npersonalized knowledge adaptation unit is devised to effectively exploit the\ninformation from the knowledge base by adapting the retrieved knowledge to the\ntarget samples. Extensive experiments on two public datasets show that D2K\nsignificantly outperforms existing baselines and is compatible with a major\ncollection of recommendation algorithms.\n","authors":["Jiarui Qin","Weiwen Liu","Ruiming Tang","Weinan Zhang","Yong Yu"],"pdf_url":"https://arxiv.org/pdf/2401.11478v2.pdf","comment":"12 pages, 7 figures"},{"id":"http://arxiv.org/abs/2309.09477v2","updated":"2024-01-23T00:46:07Z","published":"2023-09-18T04:17:44Z","title":"How Much Freedom Does An Effectiveness Metric Really Have?","summary":" It is tempting to assume that because effectiveness metrics have free choice\nto assign scores to search engine result pages (SERPs) there must thus be a\nsimilar degree of freedom as to the relative order that SERP pairs can be put\ninto. In fact that second freedom is, to a considerable degree, illusory.\nThat's because if one SERP in a pair has been given a certain score by a\nmetric, fundamental ordering constraints in many cases then dictate that the\nscore for the second SERP must be either not less than, or not greater than,\nthe score assigned to the first SERP. We refer to these fixed relationships as\ninnate pairwise SERP orderings. Our first goal in this work is to describe and\ndefend those pairwise SERP relationship constraints, and tabulate their\nrelative occurrence via both exhaustive and empirical experimentation.\n We then consider how to employ such innate pairwise relationships in IR\nexperiments, leading to a proposal for a new measurement paradigm.\nSpecifically, we argue that tables of results in which many different metrics\nare listed for champion versus challenger system comparisons should be avoided;\nand that instead a single metric be argued for in principled terms, with any\nrelationships identified by that metric then reinforced via an assessment of\nthe innate relationship as to whether other metrics - indeed, all other metrics\n- are likely to yield the same system-vs-system outcome.\n","authors":["Alistair Moffat","Joel Mackenzie"],"pdf_url":"https://arxiv.org/pdf/2309.09477v2.pdf","comment":"To Appear: Journal of the Association for Information Science and\n Technology, 2024"},{"id":"http://arxiv.org/abs/2401.10841v2","updated":"2024-01-23T20:05:30Z","published":"2024-01-19T17:40:50Z","title":"Using LLMs to discover emerging coded antisemitic hate-speech in\n extremist social media","summary":" Online hate speech proliferation has created a difficult problem for social\nmedia platforms. A particular challenge relates to the use of coded language by\ngroups interested in both creating a sense of belonging for its users and\nevading detection. Coded language evolves quickly and its use varies over time.\nThis paper proposes a methodology for detecting emerging coded hate-laden\nterminology. The methodology is tested in the context of online antisemitic\ndiscourse. The approach considers posts scraped from social media platforms,\noften used by extremist users. The posts are scraped using seed expressions\nrelated to previously known discourse of hatred towards Jews. The method begins\nby identifying the expressions most representative of each post and calculating\ntheir frequency in the whole corpus. It filters out grammatically incoherent\nexpressions as well as previously encountered ones so as to focus on emergent\nwell-formed terminology. This is followed by an assessment of semantic\nsimilarity to known antisemitic terminology using a fine-tuned large language\nmodel, and subsequent filtering out of the expressions that are too distant\nfrom known expressions of hatred. Emergent antisemitic expressions containing\nterms clearly relating to Jewish topics are then removed to return only coded\nexpressions of hatred.\n","authors":["Dhanush Kikkisetti","Raza Ul Mustafa","Wendy Melillo","Roberto Corizzo","Zois Boukouvalas","Jeff Gill","Nathalie Japkowicz"],"pdf_url":"https://arxiv.org/pdf/2401.10841v2.pdf","comment":"9 pages, 4 figures, 2 algorithms, 3 tables"}],"Machine Learning":[{"id":"http://arxiv.org/abs/2401.04079v2","updated":"2024-01-23T18:59:52Z","published":"2024-01-08T18:31:38Z","title":"RudolfV: A Foundation Model by Pathologists for Pathologists","summary":" Histopathology plays a central role in clinical medicine and biomedical\nresearch. While artificial intelligence shows promising results on many\npathological tasks, generalization and dealing with rare diseases, where\ntraining data is scarce, remains a challenge. Distilling knowledge from\nunlabeled data into a foundation model before learning from, potentially\nlimited, labeled data provides a viable path to address these challenges. In\nthis work, we extend the state of the art of foundation models for digital\npathology whole slide images by semi-automated data curation and incorporating\npathologist domain knowledge. Specifically, we combine computational and\npathologist domain knowledge (1) to curate a diverse dataset of 103k slides\ncorresponding to 750 million image patches covering data from different\nfixation, staining, and scanning protocols as well as data from different\nindications and labs across the EU and US, (2) for grouping semantically\nsimilar slides and tissue patches, and (3) to augment the input images during\ntraining. We evaluate the resulting model on a set of public and internal\nbenchmarks and show that although our foundation model is trained with an order\nof magnitude less slides, it performs on par or better than competing models.\nWe expect that scaling our approach to more data and larger models will further\nincrease its performance and capacity to deal with increasingly complex real\nworld tasks in diagnostics and biomedical research.\n","authors":["Jonas Dippel","Barbara Feulner","Tobias Winterhoff","Simon Schallenberg","Gabriel Dernbach","Andreas Kunft","Stephan Tietz","Philipp Jurmeister","David Horst","Lukas Ruff","Klaus-Robert Müller","Frederick Klauschen","Maximilian Alber"],"pdf_url":"https://arxiv.org/pdf/2401.04079v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12433v2","updated":"2024-01-23T18:59:39Z","published":"2023-12-19T18:58:40Z","title":"Tracking Any Object Amodally","summary":" Amodal perception, the ability to comprehend complete object structures from\npartial visibility, is a fundamental skill, even for infants. Its significance\nextends to applications like autonomous driving, where a clear understanding of\nheavily occluded objects is essential. However, modern detection and tracking\nalgorithms often overlook this critical capability, perhaps due to the\nprevalence of modal annotations in most datasets. To address the scarcity of\namodal data, we introduce the TAO-Amodal benchmark, featuring 880 diverse\ncategories in thousands of video sequences. Our dataset includes amodal and\nmodal bounding boxes for visible and occluded objects, including objects that\nare partially out-of-frame. To enhance amodal tracking with object permanence,\nwe leverage a lightweight plug-in module, the amodal expander, to transform\nstandard, modal trackers into amodal ones through fine-tuning on a few hundred\nvideo sequences with data augmentation. We achieve a 3.3\\% and 1.6\\%\nimprovement on the detection and tracking of occluded objects on TAO-Amodal.\nWhen evaluated on people, our method produces dramatic improvements of 2x\ncompared to state-of-the-art modal baselines.\n","authors":["Cheng-Yen Hsieh","Tarasha Khurana","Achal Dave","Deva Ramanan"],"pdf_url":"https://arxiv.org/pdf/2312.12433v2.pdf","comment":"Project Page: https://tao-amodal.github.io"},{"id":"http://arxiv.org/abs/2401.12973v1","updated":"2024-01-23T18:59:21Z","published":"2024-01-23T18:59:21Z","title":"In-Context Language Learning: Arhitectures and Algorithms","summary":" Large-scale neural language models exhibit a remarkable capacity for\nin-context learning (ICL): they can infer novel functions from datasets\nprovided as input. Most of our current understanding of when and how ICL arises\ncomes from LMs trained on extremely simple learning problems like linear\nregression and associative recall. There remains a significant gap between\nthese model problems and the \"real\" ICL exhibited by LMs trained on large text\ncorpora, which involves not just retrieval and function approximation but\nfree-form generation of language and other structured outputs. In this paper,\nwe study ICL through the lens of a new family of model problems we term in\ncontext language learning (ICLL). In ICLL, LMs are presented with a set of\nstrings from a formal language, and must generate additional strings from the\nsame language. We focus on in-context learning of regular languages generated\nby random finite automata. We evaluate a diverse set of neural sequence models\n(including several RNNs, Transformers, and state-space model variants) on\nregular ICLL tasks, aiming to answer three questions: (1) Which model classes\nare empirically capable of ICLL? (2) What algorithmic solutions do successful\nmodels implement to perform ICLL? (3) What architectural changes can improve\nICLL in less performant models? We first show that Transformers significantly\noutperform neural sequence models with recurrent or convolutional\nrepresentations on ICLL tasks. Next, we provide evidence that their ability to\ndo so relies on specialized \"n-gram heads\" (higher-order variants of induction\nheads) that compute input-conditional next-token distributions. Finally, we\nshow that hard-wiring these heads into recurrent and convolutional models\nimproves performance not just on ICLL, but natural language modeling --\nimproving the perplexity of 340M-parameter models by up to 1.14 points (6.7%)\non the SlimPajama dataset.\n","authors":["Ekin Akyürek","Bailin Wang","Yoon Kim","Jacob Andreas"],"pdf_url":"https://arxiv.org/pdf/2401.12973v1.pdf","comment":"29 pages, 8 figures"},{"id":"http://arxiv.org/abs/2401.12972v1","updated":"2024-01-23T18:58:35Z","published":"2024-01-23T18:58:35Z","title":"On the Efficacy of Text-Based Input Modalities for Action Anticipation","summary":" Although the task of anticipating future actions is highly uncertain,\ninformation from additional modalities help to narrow down plausible action\nchoices. Each modality provides different environmental context for the model\nto learn from. While previous multi-modal methods leverage information from\nmodalities such as video and audio, we primarily explore how text inputs for\nactions and objects can also enable more accurate action anticipation.\nTherefore, we propose a Multi-modal Anticipative Transformer (MAT), an\nattention-based video transformer architecture that jointly learns from\nmulti-modal features and text captions. We train our model in two-stages, where\nthe model first learns to predict actions in the video clip by aligning with\ncaptions, and during the second stage, we fine-tune the model to predict future\nactions. Compared to existing methods, MAT has the advantage of learning\nadditional environmental context from two kinds of text inputs: action\ndescriptions during the pre-training stage, and the text inputs for detected\nobjects and actions during modality feature fusion. Through extensive\nexperiments, we evaluate the effectiveness of the pre-training stage, and show\nthat our model outperforms previous methods on all datasets. In addition, we\nexamine the impact of object and action information obtained via text and\nperform extensive ablations. We evaluate the performance on on three datasets:\nEpicKitchens-100, EpicKitchens-55 and EGTEA GAZE+; and show that text\ndescriptions do indeed aid in more effective action anticipation.\n","authors":["Apoorva Beedu","Karan Samel","Irfan Essa"],"pdf_url":"https://arxiv.org/pdf/2401.12972v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12963v1","updated":"2024-01-23T18:45:54Z","published":"2024-01-23T18:45:54Z","title":"AutoRT: Embodied Foundation Models for Large Scale Orchestration of\n Robotic Agents","summary":" Foundation models that incorporate language, vision, and more recently\nactions have revolutionized the ability to harness internet scale data to\nreason about useful tasks. However, one of the key challenges of training\nembodied foundation models is the lack of data grounded in the physical world.\nIn this paper, we propose AutoRT, a system that leverages existing foundation\nmodels to scale up the deployment of operational robots in completely unseen\nscenarios with minimal human supervision. AutoRT leverages vision-language\nmodels (VLMs) for scene understanding and grounding, and further uses large\nlanguage models (LLMs) for proposing diverse and novel instructions to be\nperformed by a fleet of robots. Guiding data collection by tapping into the\nknowledge of foundation models enables AutoRT to effectively reason about\nautonomy tradeoffs and safety while significantly scaling up data collection\nfor robot learning. We demonstrate AutoRT proposing instructions to over 20\nrobots across multiple buildings and collecting 77k real robot episodes via\nboth teleoperation and autonomous robot policies. We experimentally show that\nsuch \"in-the-wild\" data collected by AutoRT is significantly more diverse, and\nthat AutoRT's use of LLMs allows for instruction following data collection\nrobots that can align to human preferences.\n","authors":["Michael Ahn","Debidatta Dwibedi","Chelsea Finn","Montse Gonzalez Arenas","Keerthana Gopalakrishnan","Karol Hausman","Brian Ichter","Alex Irpan","Nikhil Joshi","Ryan Julian","Sean Kirmani","Isabel Leal","Edward Lee","Sergey Levine","Yao Lu","Isabel Leal","Sharath Maddineni","Kanishka Rao","Dorsa Sadigh","Pannag Sanketi","Pierre Sermanet","Quan Vuong","Stefan Welker","Fei Xia","Ted Xiao","Peng Xu","Steve Xu","Zhuo Xu"],"pdf_url":"https://arxiv.org/pdf/2401.12963v1.pdf","comment":"26 pages, 9 figures"},{"id":"http://arxiv.org/abs/2401.12961v1","updated":"2024-01-23T18:45:27Z","published":"2024-01-23T18:45:27Z","title":"Chatterbox: Robust Transport for LLM Token Streaming under Unstable\n Network","summary":" To render each generated token in real time, the LLM server generates\nresponse tokens one by one and streams each generated token (or group of a few\ntokens) through the network to the user right after it is generated, which we\nrefer to as LLM token streaming. However, under unstable network conditions,\nthe LLM token streaming experience could suffer greatly from stalls since one\npacket loss could block the rendering of tokens contained in subsequent packets\neven if they arrive on time. With a real-world measurement study, we show that\ncurrent applications including ChatGPT, Claude, and Bard all suffer from\nincreased stall under unstable network.\n For this emerging token streaming problem in LLM Chatbots, we propose a novel\ntransport layer scheme, called Chatterbox, which puts new generated tokens as\nwell as currently unacknowledged tokens in the next outgoing packet. This\nensures that each packet contains some new tokens and can be independently\nrendered when received, thus avoiding aforementioned stalls caused by missing\npackets. Through simulation under various network conditions, we show\nChatterbox reduces stall ratio (proportion of token rendering wait time) by\n71.0% compared to the token streaming method commonly used by real chatbot\napplications and by 31.6% compared to a custom packet duplication scheme. By\ntailoring Chatterbox to fit the token-by-token generation of LLM, we enable the\nChatbots to respond like an eloquent speaker for users to better enjoy\npervasive AI.\n","authors":["Hanchen Li","Yuhan Liu","Yihua Cheng","Siddhant Ray","Kuntai Du","Junchen Jiang"],"pdf_url":"https://arxiv.org/pdf/2401.12961v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2206.00775v2","updated":"2024-01-23T18:31:01Z","published":"2022-06-01T21:37:03Z","title":"Adaptive Local Neighborhood-based Neural Networks for MR Image\n Reconstruction from Undersampled Data","summary":" Recent medical image reconstruction techniques focus on generating\nhigh-quality medical images suitable for clinical use at the lowest possible\ncost and with the fewest possible adverse effects on patients. Recent works\nhave shown significant promise for reconstructing MR images from sparsely\nsampled k-space data using deep learning. In this work, we propose a technique\nthat rapidly estimates deep neural networks directly at reconstruction time by\nfitting them on small adaptively estimated neighborhoods of a training set. In\nbrief, our algorithm alternates between searching for neighbors in a data set\nthat are similar to the test reconstruction, and training a local network on\nthese neighbors followed by updating the test reconstruction. Because our\nreconstruction model is learned on a dataset that is in some sense similar to\nthe image being reconstructed rather than being fit on a large, diverse\ntraining set, it is more adaptive to new scans. It can also handle changes in\ntraining sets and flexible scan settings, while being relatively fast. Our\napproach, dubbed LONDN-MRI, was validated on multiple data sets using deep\nunrolled reconstruction networks. Reconstructions were performed at four fold\nand eight fold undersampling of k-space with 1D variable-density random\nphase-encode undersampling masks. Our results demonstrate that our proposed\nlocally-trained method produces higher-quality reconstructions compared to\nmodels trained globally on larger datasets as well as other scan-adaptive\nmethods.\n","authors":["Shijun Liang","Anish Lahiri","Saiprasad Ravishankar"],"pdf_url":"https://arxiv.org/pdf/2206.00775v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15318v3","updated":"2024-01-23T18:27:30Z","published":"2023-10-23T19:35:57Z","title":"HetGPT: Harnessing the Power of Prompt Tuning in Pre-Trained\n Heterogeneous Graph Neural Networks","summary":" Graphs have emerged as a natural choice to represent and analyze the\nintricate patterns and rich information of the Web, enabling applications such\nas online page classification and social recommendation. The prevailing\n\"pre-train, fine-tune\" paradigm has been widely adopted in graph machine\nlearning tasks, particularly in scenarios with limited labeled nodes. However,\nthis approach often exhibits a misalignment between the training objectives of\npretext tasks and those of downstream tasks. This gap can result in the\n\"negative transfer\" problem, wherein the knowledge gained from pre-training\nadversely affects performance in the downstream tasks. The surge in\nprompt-based learning within Natural Language Processing (NLP) suggests the\npotential of adapting a \"pre-train, prompt\" paradigm to graphs as an\nalternative. However, existing graph prompting techniques are tailored to\nhomogeneous graphs, neglecting the inherent heterogeneity of Web graphs. To\nbridge this gap, we propose HetGPT, a general post-training prompting framework\nto improve the predictive performance of pre-trained heterogeneous graph neural\nnetworks (HGNNs). The key is the design of a novel prompting function that\nintegrates a virtual class prompt and a heterogeneous feature prompt, with the\naim to reformulate downstream tasks to mirror pretext tasks. Moreover, HetGPT\nintroduces a multi-view neighborhood aggregation mechanism, capturing the\ncomplex neighborhood structure in heterogeneous graphs. Extensive experiments\non three benchmark datasets demonstrate HetGPT's capability to enhance the\nperformance of state-of-the-art HGNNs on semi-supervised node classification.\n","authors":["Yihong Ma","Ning Yan","Jiayu Li","Masood Mortazavi","Nitesh V. Chawla"],"pdf_url":"https://arxiv.org/pdf/2310.15318v3.pdf","comment":"Accepted to WWW 2024 as research paper"},{"id":"http://arxiv.org/abs/2401.12950v1","updated":"2024-01-23T18:15:58Z","published":"2024-01-23T18:15:58Z","title":"Bayesian Semi-structured Subspace Inference","summary":" Semi-structured regression models enable the joint modeling of interpretable\nstructured and complex unstructured feature effects. The structured model part\nis inspired by statistical models and can be used to infer the input-output\nrelationship for features of particular importance. The complex unstructured\npart defines an arbitrary deep neural network and thereby provides enough\nflexibility to achieve competitive prediction performance. While these models\ncan also account for aleatoric uncertainty, there is still a lack of work on\naccounting for epistemic uncertainty. In this paper, we address this problem by\npresenting a Bayesian approximation for semi-structured regression models using\nsubspace inference. To this end, we extend subspace inference for joint\nposterior sampling from a full parameter space for structured effects and a\nsubspace for unstructured effects. Apart from this hybrid sampling scheme, our\nmethod allows for tunable complexity of the subspace and can capture multiple\nminima in the loss landscape. Numerical experiments validate our approach's\nefficacy in recovering structured effect parameter posteriors in\nsemi-structured models and approaching the full-space posterior distribution of\nMCMC for increasing subspace dimension. Further, our approach exhibits\ncompetitive predictive performance across simulated and real-world datasets.\n","authors":["Daniel Dold","David Rügamer","Beate Sick","Oliver Dürr"],"pdf_url":"https://arxiv.org/pdf/2401.12950v1.pdf","comment":"Accepted at AISTATS 2024"},{"id":"http://arxiv.org/abs/2309.10140v2","updated":"2024-01-23T18:08:34Z","published":"2023-09-18T20:39:12Z","title":"A Geometric Framework for Neural Feature Learning","summary":" We present a novel framework for learning system design based on neural\nfeature extractors. First, we introduce the feature geometry, which unifies\nstatistical dependence and features in the same function space with geometric\nstructures. By applying the feature geometry, we formulate each learning\nproblem as solving the optimal feature approximation of the dependence\ncomponent specified by the learning setting. We propose a nesting technique for\ndesigning learning algorithms to learn the optimal features from data samples,\nwhich can be applied to off-the-shelf network architectures and optimizers. To\ndemonstrate the applications of the nesting technique, we further discuss\nmultivariate learning problems, including conditioned inference and multimodal\nlearning, where we present the optimal features and reveal their connections to\nclassical approaches.\n","authors":["Xiangxiang Xu","Lizhong Zheng"],"pdf_url":"https://arxiv.org/pdf/2309.10140v2.pdf","comment":"76 pages, 24 figures"},{"id":"http://arxiv.org/abs/2205.13743v5","updated":"2024-01-23T17:53:23Z","published":"2022-05-27T03:12:18Z","title":"Personalized Algorithmic Recourse with Preference Elicitation","summary":" Algorithmic Recourse (AR) is the problem of computing a sequence of actions\nthat -- once performed by a user -- overturns an undesirable machine decision.\nIt is paramount that the sequence of actions does not require too much effort\nfor users to implement. Yet, most approaches to AR assume that actions cost the\nsame for all users, and thus may recommend unfairly expensive recourse plans to\ncertain users. Prompted by this observation, we introduce PEAR, the first\nhuman-in-the-loop approach capable of providing personalized algorithmic\nrecourse tailored to the needs of any end-user. PEAR builds on insights from\nBayesian Preference Elicitation to iteratively refine an estimate of the costs\nof actions by asking choice set queries to the target user. The queries\nthemselves are computed by maximizing the Expected Utility of Selection, a\nprincipled measure of information gain accounting for uncertainty on both the\ncost estimate and the user's responses. PEAR integrates elicitation into a\nReinforcement Learning agent coupled with Monte Carlo Tree Search to quickly\nidentify promising recourse plans. Our empirical evaluation on real-world\ndatasets highlights how PEAR produces high-quality personalized recourse in\nonly a handful of iterations.\n","authors":["Giovanni De Toni","Paolo Viappiani","Stefano Teso","Bruno Lepri","Andrea Passerini"],"pdf_url":"https://arxiv.org/pdf/2205.13743v5.pdf","comment":"Published in Transactions in Machine Learning Research (TMLR),\n January 2024. See https://openreview.net/forum?id=8sg2I9zXgO for the official\n submission"},{"id":"http://arxiv.org/abs/2401.11488v2","updated":"2024-01-23T17:49:42Z","published":"2024-01-21T13:24:41Z","title":"HARDCORE: H-field and power loss estimation for arbitrary waveforms with\n residual, dilated convolutional neural networks in ferrite cores","summary":" The MagNet Challenge 2023 calls upon competitors to develop data-driven\nmodels for the material-specific, waveform-agnostic estimation of steady-state\npower losses in toroidal ferrite cores. The following HARDCORE (H-field and\npower loss estimation for Arbitrary waveforms with Residual, Dilated\nconvolutional neural networks in ferrite COREs) approach shows that a residual\nconvolutional neural network with physics-informed extensions can serve this\ntask efficiently when trained on observational data beforehand. One key\nsolution element is an intermediate model layer which first reconstructs the bh\ncurve and then estimates the power losses based on the curve's area rendering\nthe proposed topology physically interpretable. In addition, emphasis was\nplaced on expert-based feature engineering and information-rich inputs in order\nto enable a lean model architecture. A model is trained from scratch for each\nmaterial, while the topology remains the same. A Pareto-style trade-off between\nmodel size and estimation accuracy is demonstrated, which yields an optimum at\nas low as 1755 parameters and down to below 8\\,\\% for the 95-th percentile of\nthe relative error for the worst-case material with sufficient samples.\n","authors":["Wilhelm Kirchgässner","Nikolas Förster","Till Piepenbrock","Oliver Schweins","Oliver Wallscheid"],"pdf_url":"https://arxiv.org/pdf/2401.11488v2.pdf","comment":"Competition submission version, slightly change author order"},{"id":"http://arxiv.org/abs/2401.12934v1","updated":"2024-01-23T17:42:17Z","published":"2024-01-23T17:42:17Z","title":"Reward-Relevance-Filtered Linear Offline Reinforcement Learning","summary":" This paper studies offline reinforcement learning with linear function\napproximation in a setting with decision-theoretic, but not estimation\nsparsity. The structural restrictions of the data-generating process presume\nthat the transitions factor into a sparse component that affects the reward and\ncould affect additional exogenous dynamics that do not affect the reward.\nAlthough the minimally sufficient adjustment set for estimation of full-state\ntransition properties depends on the whole state, the optimal policy and\ntherefore state-action value function depends only on the sparse component: we\ncall this causal/decision-theoretic sparsity. We develop a method for\nreward-filtering the estimation of the state-action value function to the\nsparse component by a modification of thresholded lasso in least-squares policy\nevaluation. We provide theoretical guarantees for our reward-filtered linear\nfitted-Q-iteration, with sample complexity depending only on the size of the\nsparse component.\n","authors":["Angela Zhou"],"pdf_url":"https://arxiv.org/pdf/2401.12934v1.pdf","comment":"conference version accepted at AISTATS 2024"},{"id":"http://arxiv.org/abs/2401.12930v1","updated":"2024-01-23T17:33:41Z","published":"2024-01-23T17:33:41Z","title":"pyAKI - An Open Source Solution to Automated KDIGO classification","summary":" Acute Kidney Injury (AKI) is a frequent complication in critically ill\npatients, affecting up to 50% of patients in the intensive care units. The lack\nof standardized and open-source tools for applying the Kidney Disease Improving\nGlobal Outcomes (KDIGO) criteria to time series data has a negative impact on\nworkload and study quality. This project introduces pyAKI, an open-source\npipeline addressing this gap by providing a comprehensive solution for\nconsistent KDIGO criteria implementation.\n The pyAKI pipeline was developed and validated using a subset of the Medical\nInformation Mart for Intensive Care (MIMIC)-IV database, a commonly used\ndatabase in critical care research. We defined a standardized data model in\norder to ensure reproducibility. Validation against expert annotations\ndemonstrated pyAKI's robust performance in implementing KDIGO criteria.\nComparative analysis revealed its ability to surpass the quality of human\nlabels.\n This work introduces pyAKI as an open-source solution for implementing the\nKDIGO criteria for AKI diagnosis using time series data with high accuracy and\nperformance.\n","authors":["Christian Porschen","Jan Ernsting","Paul Brauckmann","Raphael Weiss","Till Würdemann","Hendrik Booke","Wida Amini","Ludwig Maidowski","Benjamin Risse","Tim Hahn","Thilo von Groote"],"pdf_url":"https://arxiv.org/pdf/2401.12930v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2204.13209v2","updated":"2024-01-23T17:31:48Z","published":"2022-04-27T21:58:07Z","title":"Robust stabilization of polytopic systems via fast and reliable neural\n network-based approximations","summary":" We consider the design of fast and reliable neural network (NN)-based\napproximations of traditional stabilizing controllers for linear systems with\npolytopic uncertainty, including control laws with variable structure and those\nbased on a (minimal) selection policy. Building upon recent approaches for the\ndesign of reliable control surrogates with guaranteed structural properties, we\ndevelop a systematic procedure to certify the closed-loop stability and\nperformance of a linear uncertain system when a trained rectified linear unit\n(ReLU)-based approximation replaces such traditional controllers. First, we\nprovide a sufficient condition, which involves the worst-case approximation\nerror between ReLU-based and traditional controller-based state-to-input\nmappings, ensuring that the system is ultimately bounded within a set with\nadjustable size and convergence rate. Then, we develop an offline,\nmixed-integer optimization-based method that allows us to compute that quantity\nexactly.\n","authors":["Filippo Fabiani","Paul J. Goulart"],"pdf_url":"https://arxiv.org/pdf/2204.13209v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.03131v2","updated":"2024-01-23T17:29:54Z","published":"2023-11-06T14:28:11Z","title":"Reservoir-Computing Model for Mapping and Forecasting Neuronal\n Interactions from Electrophysiological Data","summary":" Electrophysiological nature of neuronal networks allows to reveal various\ninteractions between different cell units at a very short time-scales. One of\nthe many challenges in analyzing these signals is to retrieve the morphology\nand functionality of a given network. In this work we developed a computational\nmodel, based on Reservoir Computing Network (RCN) architecture, which decodes\nthe spatio-temporal data from electro-physiological measurements of neuronal\ncultures and reconstructs the network structure on a macroscopic domain,\nrepresenting the connectivity between neuronal units. We demonstrate that the\nmodel can predict the connectivity map of the network with higher accuracy than\nthe common methods such as Cross-Correlation and Transfer-Entropy. In addition,\nwe experimentally demonstrate the ability of the model to predict a network\nresponse to a specific input, such as localized stimulus.\n","authors":["Ilya Auslender","Giorgio Letti","Yasaman Heydari","Clara Zaccaria","Lorenzo Pavesi"],"pdf_url":"https://arxiv.org/pdf/2311.03131v2.pdf","comment":"Pre-submission draft"},{"id":"http://arxiv.org/abs/2205.05587v3","updated":"2024-01-23T17:26:09Z","published":"2022-05-11T16:00:14Z","title":"Choice of training label matters: how to best use deep learning for\n quantitative MRI parameter estimation","summary":" Deep learning (DL) is gaining popularity as a parameter estimation method for\nquantitative MRI. A range of competing implementations have been proposed,\nrelying on either supervised or self-supervised learning. Self-supervised\napproaches, sometimes referred to as unsupervised, have been loosely based on\nauto-encoders, whereas supervised methods have, to date, been trained on\ngroundtruth labels. These two learning paradigms have been shown to have\ndistinct strengths. Notably, self-supervised approaches have offered lower-bias\nparameter estimates than their supervised alternatives. This result is\ncounterintuitive - incorporating prior knowledge with supervised labels should,\nin theory, lead to improved accuracy. In this work, we show that this apparent\nlimitation of supervised approaches stems from the naive choice of groundtruth\ntraining labels. By training on labels which are deliberately not groundtruth,\nwe show that the low-bias parameter estimation previously associated with\nself-supervised methods can be replicated - and improved on - within a\nsupervised learning framework. This approach sets the stage for a single,\nunifying, deep learning parameter estimation framework, based on supervised\nlearning, where trade-offs between bias and variance are made by careful\nadjustment of training label.\n","authors":["Sean C. Epstein","Timothy J. P. Bray","Margaret Hall-Craggs","Hui Zhang"],"pdf_url":"https://arxiv.org/pdf/2205.05587v3.pdf","comment":"Accepted for publication at the Journal of Machine Learning for\n Biomedical Imaging (MELBA) https://melba-journal.org/2024:002"},{"id":"http://arxiv.org/abs/2401.12926v1","updated":"2024-01-23T17:22:00Z","published":"2024-01-23T17:22:00Z","title":"DsDm: Model-Aware Dataset Selection with Datamodels","summary":" When selecting data for training large-scale models, standard practice is to\nfilter for examples that match human notions of data quality. Such filtering\nyields qualitatively clean datapoints that intuitively should improve model\nbehavior. However, in practice the opposite can often happen: we find that\nselecting according to similarity with \"high quality\" data sources may not\nincrease (and can even hurt) performance compared to randomly selecting data.\n To develop better methods for selecting data, we start by framing dataset\nselection as an optimization problem that we can directly solve for: given\ntarget tasks, a learning algorithm, and candidate data, select the subset that\nmaximizes model performance. This framework thus avoids handpicked notions of\ndata quality, and instead models explicitly how the learning process uses train\ndatapoints to predict on the target tasks. Our resulting method greatly\nimproves language model (LM) performance on both pre-specified tasks and\npreviously unseen tasks. Specifically, choosing target tasks representative of\nstandard LM problems and evaluating on diverse held-out benchmarks, our\nselected datasets provide a 2x compute multiplier over baseline methods.\n","authors":["Logan Engstrom","Axel Feldmann","Aleksander Madry"],"pdf_url":"https://arxiv.org/pdf/2401.12926v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12924v1","updated":"2024-01-23T17:20:52Z","published":"2024-01-23T17:20:52Z","title":"Performance Analysis of Support Vector Machine (SVM) on Challenging\n Datasets for Forest Fire Detection","summary":" This article delves into the analysis of performance and utilization of\nSupport Vector Machines (SVMs) for the critical task of forest fire detection\nusing image datasets. With the increasing threat of forest fires to ecosystems\nand human settlements, the need for rapid and accurate detection systems is of\nutmost importance. SVMs, renowned for their strong classification capabilities,\nexhibit proficiency in recognizing patterns associated with fire within images.\nBy training on labeled data, SVMs acquire the ability to identify distinctive\nattributes associated with fire, such as flames, smoke, or alterations in the\nvisual characteristics of the forest area. The document thoroughly examines the\nuse of SVMs, covering crucial elements like data preprocessing, feature\nextraction, and model training. It rigorously evaluates parameters such as\naccuracy, efficiency, and practical applicability. The knowledge gained from\nthis study aids in the development of efficient forest fire detection systems,\nenabling prompt responses and improving disaster management. Moreover, the\ncorrelation between SVM accuracy and the difficulties presented by\nhigh-dimensional datasets is carefully investigated, demonstrated through a\nrevealing case study. The relationship between accuracy scores and the\ndifferent resolutions used for resizing the training datasets has also been\ndiscussed in this article. These comprehensive studies result in a definitive\noverview of the difficulties faced and the potential sectors requiring further\nimprovement and focus.\n","authors":["Ankan Kar","Nirjhar Nath","Utpalraj Kemprai"," Aman"],"pdf_url":"https://arxiv.org/pdf/2401.12924v1.pdf","comment":"19 pages, 8 figures, accepted in IJCNS of SCIRP (not yet published)"},{"id":"http://arxiv.org/abs/2401.12923v1","updated":"2024-01-23T17:20:48Z","published":"2024-01-23T17:20:48Z","title":"Deep multitask neural networks for solving some stochastic optimal\n control problems","summary":" Most existing neural network-based approaches for solving stochastic optimal\ncontrol problems using the associated backward dynamic programming principle\nrely on the ability to simulate the underlying state variables. However, in\nsome problems, this simulation is infeasible, leading to the discretization of\nstate variable space and the need to train one neural network for each data\npoint. This approach becomes computationally inefficient when dealing with\nlarge state variable spaces. In this paper, we consider a class of this type of\nstochastic optimal control problems and introduce an effective solution\nemploying multitask neural networks. To train our multitask neural network, we\nintroduce a novel scheme that dynamically balances the learning across tasks.\nThrough numerical experiments on real-world derivatives pricing problems, we\nprove that our method outperforms state-of-the-art approaches.\n","authors":["Christian Yeo"],"pdf_url":"https://arxiv.org/pdf/2401.12923v1.pdf","comment":"8 pages"},{"id":"http://arxiv.org/abs/2209.07805v4","updated":"2024-01-23T17:14:20Z","published":"2022-09-16T09:09:15Z","title":"A Comprehensive Benchmark for COVID-19 Predictive Modeling Using\n Electronic Health Records in Intensive Care","summary":" The COVID-19 pandemic has posed a heavy burden to the healthcare system\nworldwide and caused huge social disruption and economic loss. Many deep\nlearning models have been proposed to conduct clinical predictive tasks such as\nmortality prediction for COVID-19 patients in intensive care units using\nElectronic Health Record (EHR) data. Despite their initial success in certain\nclinical applications, there is currently a lack of benchmarking results to\nachieve a fair comparison so that we can select the optimal model for clinical\nuse. Furthermore, there is a discrepancy between the formulation of traditional\nprediction tasks and real-world clinical practice in intensive care. To fill\nthese gaps, we propose two clinical prediction tasks, Outcome-specific\nlength-of-stay prediction and Early mortality prediction for COVID-19 patients\nin intensive care units. The two tasks are adapted from the naive\nlength-of-stay and mortality prediction tasks to accommodate the clinical\npractice for COVID-19 patients. We propose fair, detailed, open-source\ndata-preprocessing pipelines and evaluate 17 state-of-the-art predictive models\non two tasks, including 5 machine learning models, 6 basic deep learning models\nand 6 deep learning predictive models specifically designed for EHR data. We\nprovide benchmarking results using data from two real-world COVID-19 EHR\ndatasets. One dataset is publicly available without needing any inquiry and\nanother dataset can be accessed on request. We provide fair, reproducible\nbenchmarking results for two tasks. We deploy all experiment results and models\non an online platform. We also allow clinicians and researchers to upload their\ndata to the platform and get quick prediction results using our trained models.\nWe hope our efforts can further facilitate deep learning and machine learning\nresearch for COVID-19 predictive modeling.\n","authors":["Junyi Gao","Yinghao Zhu","Wenqing Wang","Yasha Wang","Wen Tang","Ewen M. Harrison","Liantao Ma"],"pdf_url":"https://arxiv.org/pdf/2209.07805v4.pdf","comment":"Junyi Gao, Yinghao Zhu and Wenqing Wang contributed equally"},{"id":"http://arxiv.org/abs/2106.01135v4","updated":"2024-01-23T16:52:16Z","published":"2021-06-02T13:05:34Z","title":"MNL-Bandit with Knapsacks: a near-optimal algorithm","summary":" We consider a dynamic assortment selection problem where a seller has a fixed\ninventory of $N$ substitutable products and faces an unknown demand that\narrives sequentially over $T$ periods. In each period, the seller needs to\ndecide on the assortment of products (satisfying certain constraints) to offer\nto the customers. The customer's response follows an unknown multinomial logit\nmodel (MNL) with parameter $\\boldsymbol{v}$. If customer selects product $i \\in\n[N]$, the seller receives revenue $r_i$. The goal of the seller is to maximize\nthe total expected revenue from the $T$ customers given the fixed initial\ninventory of $N$ products. We present MNLwK-UCB, a UCB-based algorithm and\ncharacterize its regret under different regimes of inventory size. We show that\nwhen the inventory size grows quasi-linearly in time, MNLwK-UCB achieves a\n$\\tilde{O}(N + \\sqrt{NT})$ regret bound. We also show that for a smaller\ninventory (with growth $\\sim T^{\\alpha}$, $\\alpha < 1$), MNLwK-UCB achieves a\n$\\tilde{O}(N(1 + T^{\\frac{1 - \\alpha}{2}}) + \\sqrt{NT})$. In particular, over a\nlong time horizon $T$, the rate $\\tilde{O}(\\sqrt{NT})$ is always achieved\nregardless of the constraints and the size of the inventory.\n","authors":["Abdellah Aznag","Vineet Goyal","Noemie Perivier"],"pdf_url":"https://arxiv.org/pdf/2106.01135v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.03311v3","updated":"2024-01-23T16:34:53Z","published":"2023-12-06T06:33:25Z","title":"On the Nystrom Approximation for Preconditioning in Kernel Machines","summary":" Kernel methods are a popular class of nonlinear predictive models in machine\nlearning. Scalable algorithms for learning kernel models need to be iterative\nin nature, but convergence can be slow due to poor conditioning. Spectral\npreconditioning is an important tool to speed-up the convergence of such\niterative algorithms for training kernel models. However computing and storing\na spectral preconditioner can be expensive which can lead to large\ncomputational and storage overheads, precluding the application of kernel\nmethods to problems with large datasets. A Nystrom approximation of the\nspectral preconditioner is often cheaper to compute and store, and has\ndemonstrated success in practical applications. In this paper we analyze the\ntrade-offs of using such an approximated preconditioner. Specifically, we show\nthat a sample of logarithmic size (as a function of the size of the dataset)\nenables the Nystrom-based approximated preconditioner to accelerate gradient\ndescent nearly as well as the exact preconditioner, while also reducing the\ncomputational and storage overheads.\n","authors":["Amirhesam Abedsoltan","Parthe Pandit","Luis Rademacher","Mikhail Belkin"],"pdf_url":"https://arxiv.org/pdf/2312.03311v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12882v1","updated":"2024-01-23T16:22:50Z","published":"2024-01-23T16:22:50Z","title":"Model-Free $δ$-Policy Iteration Based on Damped Newton Method for\n Nonlinear Continuous-Time H$\\infty$ Tracking Control","summary":" This paper presents a {\\delta}-PI algorithm which is based on damped Newton\nmethod for the H{\\infty} tracking control problem of unknown continuous-time\nnonlinear system. A discounted performance function and an augmented system are\nused to get the tracking Hamilton-Jacobi-Isaac (HJI) equation. Tracking HJI\nequation is a nonlinear partial differential equation, traditional\nreinforcement learning methods for solving the tracking HJI equation are mostly\nbased on the Newton method, which usually only satisfies local convergence and\nneeds a good initial guess. Based upon the damped Newton iteration operator\nequation, a generalized tracking Bellman equation is derived firstly. The\n{\\delta}-PI algorithm can seek the optimal solution of the tracking HJI\nequation by iteratively solving the generalized tracking Bellman equation.\nOn-policy learning and off-policy learning {\\delta}-PI reinforcement learning\nmethods are provided, respectively. Off-policy version {\\delta}-PI algorithm is\na model-free algorithm which can be performed without making use of a priori\nknowledge of the system dynamics. NN-based implementation scheme for the\noff-policy {\\delta}-PI algorithms is shown. The suitability of the model-free\n{\\delta}-PI algorithm is illustrated with a nonlinear system simulation.\n","authors":["Qi Wang"],"pdf_url":"https://arxiv.org/pdf/2401.12882v1.pdf","comment":"10 pages, 8 figures"},{"id":"http://arxiv.org/abs/2303.07846v2","updated":"2024-01-23T16:14:46Z","published":"2023-03-14T12:36:01Z","title":"Sample-efficient Adversarial Imitation Learning","summary":" Imitation learning, in which learning is performed by demonstration, has been\nstudied and advanced for sequential decision-making tasks in which a reward\nfunction is not predefined. However, imitation learning methods still require\nnumerous expert demonstration samples to successfully imitate an expert's\nbehavior. To improve sample efficiency, we utilize self-supervised\nrepresentation learning, which can generate vast training signals from the\ngiven data. In this study, we propose a self-supervised representation-based\nadversarial imitation learning method to learn state and action representations\nthat are robust to diverse distortions and temporally predictive, on non-image\ncontrol tasks. In particular, in comparison with existing self-supervised\nlearning methods for tabular data, we propose a different corruption method for\nstate and action representations that is robust to diverse distortions. We\ntheoretically and empirically observe that making an informative feature\nmanifold with less sample complexity significantly improves the performance of\nimitation learning. The proposed method shows a 39% relative improvement over\nexisting adversarial imitation learning methods on MuJoCo in a setting limited\nto 100 expert state-action pairs. Moreover, we conduct comprehensive ablations\nand additional experiments using demonstrations with varying optimality to\nprovide insights into a range of factors.\n","authors":["Dahuin Jung","Hyungyu Lee","Sungroh Yoon"],"pdf_url":"https://arxiv.org/pdf/2303.07846v2.pdf","comment":"Published at JMLR (Journal of Machine Learning Research), A\n preliminary version of this manuscript was presented at Deep RL Workshop,\n NeurIPS 2022"},{"id":"http://arxiv.org/abs/2307.02764v2","updated":"2024-01-23T16:01:02Z","published":"2023-07-06T04:13:57Z","title":"When Does Confidence-Based Cascade Deferral Suffice?","summary":" Cascades are a classical strategy to enable inference cost to vary adaptively\nacross samples, wherein a sequence of classifiers are invoked in turn. A\ndeferral rule determines whether to invoke the next classifier in the sequence,\nor to terminate prediction. One simple deferral rule employs the confidence of\nthe current classifier, e.g., based on the maximum predicted softmax\nprobability. Despite being oblivious to the structure of the cascade -- e.g.,\nnot modelling the errors of downstream models -- such confidence-based deferral\noften works remarkably well in practice. In this paper, we seek to better\nunderstand the conditions under which confidence-based deferral may fail, and\nwhen alternate deferral strategies can perform better. We first present a\ntheoretical characterisation of the optimal deferral rule, which precisely\ncharacterises settings under which confidence-based deferral may suffer. We\nthen study post-hoc deferral mechanisms, and demonstrate they can significantly\nimprove upon confidence-based deferral in settings where (i) downstream models\nare specialists that only work well on a subset of inputs, (ii) samples are\nsubject to label noise, and (iii) there is distribution shift between the train\nand test set.\n","authors":["Wittawat Jitkrittum","Neha Gupta","Aditya Krishna Menon","Harikrishna Narasimhan","Ankit Singh Rawat","Sanjiv Kumar"],"pdf_url":"https://arxiv.org/pdf/2307.02764v2.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2401.12866v1","updated":"2024-01-23T16:00:45Z","published":"2024-01-23T16:00:45Z","title":"Evaluating Collaborative and Autonomous Agents in Data-Stream-Supported\n Coordination of Mobile Crowdsourcing","summary":" Mobile crowdsourcing refers to systems where the completion of tasks\nnecessarily requires physical movement of crowdworkers in an on-demand\nworkforce. Evidence suggests that in such systems, tasks often get assigned to\ncrowdworkers who struggle to complete those tasks successfully, resulting in\nhigh failure rates and low service quality. A promising solution to ensure\nhigher quality of service is to continuously adapt the assignment and respond\nto failure-causing events by transferring tasks to better-suited workers who\nuse different routes or vehicles. However, implementing task transfers in\nmobile crowdsourcing is difficult because workers are autonomous and may reject\ntransfer requests. Moreover, task outcomes are uncertain and need to be\npredicted. In this paper, we propose different mechanisms to achieve outcome\nprediction and task coordination in mobile crowdsourcing. First, we analyze\ndifferent data stream learning approaches for the prediction of task outcomes.\nSecond, based on the suggested prediction model, we propose and evaluate two\ndifferent approaches for task coordination with different degrees of autonomy:\nan opportunistic approach for crowdshipping with collaborative, but\nnon-autonomous workers, and a market-based model with autonomous workers for\ncrowdsensing.\n","authors":["Ralf Bruns","Jeremias Dötterl","Jürgen Dunkel","Sascha Ossowski"],"pdf_url":"https://arxiv.org/pdf/2401.12866v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.14391v4","updated":"2024-01-23T15:52:28Z","published":"2023-04-27T17:55:13Z","title":"Energy-based Models are Zero-Shot Planners for Compositional Scene\n Rearrangement","summary":" Language is compositional; an instruction can express multiple relation\nconstraints to hold among objects in a scene that a robot is tasked to\nrearrange. Our focus in this work is an instructable scene-rearranging\nframework that generalizes to longer instructions and to spatial concept\ncompositions never seen at training time. We propose to represent\nlanguage-instructed spatial concepts with energy functions over relative object\narrangements. A language parser maps instructions to corresponding energy\nfunctions and an open-vocabulary visual-language model grounds their arguments\nto relevant objects in the scene. We generate goal scene configurations by\ngradient descent on the sum of energy functions, one per language predicate in\nthe instruction. Local vision-based policies then re-locate objects to the\ninferred goal locations. We test our model on established instruction-guided\nmanipulation benchmarks, as well as benchmarks of compositional instructions we\nintroduce. We show our model can execute highly compositional instructions\nzero-shot in simulation and in the real world. It outperforms\nlanguage-to-action reactive policies and Large Language Model planners by a\nlarge margin, especially for long instructions that involve compositions of\nmultiple spatial concepts. Simulation and real-world robot execution videos, as\nwell as our code and datasets are publicly available on our website:\nhttps://ebmplanner.github.io.\n","authors":["Nikolaos Gkanatsios","Ayush Jain","Zhou Xian","Yunchu Zhang","Christopher Atkeson","Katerina Fragkiadaki"],"pdf_url":"https://arxiv.org/pdf/2304.14391v4.pdf","comment":"First two authors contributed equally | RSS 2023"},{"id":"http://arxiv.org/abs/2401.12851v1","updated":"2024-01-23T15:35:50Z","published":"2024-01-23T15:35:50Z","title":"Classification of grapevine varieties using UAV hyperspectral imaging","summary":" The classification of different grapevine varieties is a relevant phenotyping\ntask in Precision Viticulture since it enables estimating the growth of\nvineyard rows dedicated to different varieties, among other applications\nconcerning the wine industry. This task can be performed with destructive\nmethods that require time-consuming tasks, including data collection and\nanalysis in the laboratory. However, Unmanned Aerial Vehicles (UAV) provide a\nmore efficient and less prohibitive approach to collecting hyperspectral data,\ndespite acquiring noisier data. Therefore, the first task is the processing of\nthese data to correct and downsample large amounts of data. In addition, the\nhyperspectral signatures of grape varieties are very similar. In this work, a\nConvolutional Neural Network (CNN) is proposed for classifying seventeen\nvarieties of red and white grape variants. Rather than classifying single\nsamples, these are processed together with their neighbourhood. Hence, the\nextraction of spatial and spectral features is addressed with 1) a spatial\nattention layer and 2) Inception blocks. The pipeline goes from processing to\ndataset elaboration, finishing with the training phase. The fitted model is\nevaluated in terms of response time, accuracy and data separability, and\ncompared with other state-of-the-art CNNs for classifying hyperspectral data.\nOur network was proven to be much more lightweight with a reduced number of\ninput bands, a lower number of trainable weights and therefore, reduced\ntraining time. Despite this, the evaluated metrics showed much better results\nfor our network (~99% overall accuracy), in comparison with previous works\nbarely achieving 81% OA.\n","authors":["Alfonso López","Carlos Javier Ogayar","Francisco Ramón Feito","Joaquim João Sousa"],"pdf_url":"https://arxiv.org/pdf/2401.12851v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12849v1","updated":"2024-01-23T15:33:30Z","published":"2024-01-23T15:33:30Z","title":"Learning safety critics via a non-contractive binary bellman operator","summary":" The inability to naturally enforce safety in Reinforcement Learning (RL),\nwith limited failures, is a core challenge impeding its use in real-world\napplications. One notion of safety of vast practical relevance is the ability\nto avoid (unsafe) regions of the state space. Though such a safety goal can be\ncaptured by an action-value-like function, a.k.a. safety critics, the\nassociated operator lacks the desired contraction and uniqueness properties\nthat the classical Bellman operator enjoys. In this work, we overcome the\nnon-contractiveness of safety critic operators by leveraging that safety is a\nbinary property. To that end, we study the properties of the binary safety\ncritic associated with a deterministic dynamical system that seeks to avoid\nreaching an unsafe region. We formulate the corresponding binary Bellman\nequation (B2E) for safety and study its properties. While the resulting\noperator is still non-contractive, we fully characterize its fixed points\nrepresenting--except for a spurious solution--maximal persistently safe regions\nof the state space that can always avoid failure. We provide an algorithm that,\nby design, leverages axiomatic knowledge of safe data to avoid spurious fixed\npoints.\n","authors":["Agustin Castellano","Hancheng Min","Juan Andrés Bazerque","Enrique Mallada"],"pdf_url":"https://arxiv.org/pdf/2401.12849v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18382v2","updated":"2024-01-23T15:27:21Z","published":"2023-10-27T02:58:11Z","title":"From Generative AI to Generative Internet of Things: Fundamentals,\n Framework, and Outlooks","summary":" Generative Artificial Intelligence (GAI) possesses the capabilities of\ngenerating realistic data and facilitating advanced decision-making. By\nintegrating GAI into modern Internet of Things (IoT), Generative Internet of\nThings (GIoT) is emerging and holds immense potential to revolutionize various\naspects of society, enabling more efficient and intelligent IoT applications,\nsuch as smart surveillance and voice assistants. In this article, we present\nthe concept of GIoT and conduct an exploration of its potential prospects.\nSpecifically, we first overview four GAI techniques and investigate promising\nGIoT applications. Then, we elaborate on the main challenges in enabling GIoT\nand propose a general GAI-based secure incentive mechanism framework to address\nthem, in which we adopt Generative Diffusion Models (GDMs) for incentive\nmechanism designs and apply blockchain technologies for secure GIoT management.\nMoreover, we conduct a case study on modern Internet of Vehicle traffic\nmonitoring, which utilizes GDMs to generate effective contracts for\nincentivizing users to contribute sensing data with high quality. Finally, we\nsuggest several open directions worth investigating for the future popularity\nof GIoT.\n","authors":["Jinbo Wen","Jiangtian Nie","Jiawen Kang","Dusit Niyato","Hongyang Du","Yang Zhang","Mohsen Guizani"],"pdf_url":"https://arxiv.org/pdf/2310.18382v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12843v1","updated":"2024-01-23T15:25:21Z","published":"2024-01-23T15:25:21Z","title":"An embedding-based distance for temporal graphs","summary":" We define a distance between temporal graphs based on graph embeddings built\nusing time-respecting random walks. We study both the case of matched graphs,\nwhen there exists a known relation between the nodes, and the unmatched case,\nwhen such a relation is unavailable and the graphs may be of different sizes.\nWe illustrate the interest of our distance definition, using both real and\nsynthetic temporal network data, by showing its ability to discriminate between\ngraphs with different structural and temporal properties. Leveraging\nstate-of-the-art machine learning techniques, we propose an efficient\nimplementation of distance computation that is viable for large-scale temporal\ngraphs.\n","authors":["Lorenzo Dall'Amico","Alain Barrat","Ciro Cattuto"],"pdf_url":"https://arxiv.org/pdf/2401.12843v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12842v1","updated":"2024-01-23T15:23:13Z","published":"2024-01-23T15:23:13Z","title":"Iterated Relevance Matrix Analysis (IRMA) for the identification of\n class-discriminative subspaces","summary":" We introduce and investigate the iterated application of Generalized Matrix\nLearning Vector Quantizaton for the analysis of feature relevances in\nclassification problems, as well as for the construction of\nclass-discriminative subspaces. The suggested Iterated Relevance Matrix\nAnalysis (IRMA) identifies a linear subspace representing the classification\nspecific information of the considered data sets using Generalized Matrix\nLearning Vector Quantization (GMLVQ). By iteratively determining a new\ndiscriminative subspace while projecting out all previously identified ones, a\ncombined subspace carrying all class-specific information can be found. This\nfacilitates a detailed analysis of feature relevances, and enables improved\nlow-dimensional representations and visualizations of labeled data sets.\nAdditionally, the IRMA-based class-discriminative subspace can be used for\ndimensionality reduction and the training of robust classifiers with\npotentially improved performance.\n","authors":["Sofie Lövdal","Michael Biehl"],"pdf_url":"https://arxiv.org/pdf/2401.12842v1.pdf","comment":"17 pages, 5 figures, 1 table. Submitted to Neurocomputing. Extension\n of 2023 ESANN conference contribution"},{"id":"http://arxiv.org/abs/2401.11202v2","updated":"2024-01-23T15:11:46Z","published":"2024-01-20T10:30:31Z","title":"PartIR: Composing SPMD Partitioning Strategies for Machine Learning","summary":" Training of modern large neural networks (NN) requires a combination of\nparallelization strategies encompassing data, model, or optimizer sharding.\nWhen strategies increase in complexity, it becomes necessary for partitioning\ntools to be 1) expressive, allowing the composition of simpler strategies, and\n2) predictable to estimate performance analytically. We present PartIR, our\ndesign for a NN partitioning system. PartIR is focused on an incremental\napproach to rewriting and is hardware-and-runtime agnostic. We present a simple\nbut powerful API for composing sharding strategies and a simulator to validate\nthem. The process is driven by high-level programmer-issued partitioning\ntactics, which can be both manual and automatic. Importantly, the tactics are\nspecified separately from the model code, making them easy to change. We\nevaluate PartIR on several different models to demonstrate its predictability,\nexpressibility, and ability to reach peak performance..\n","authors":["Sami Alabed","Bart Chrzaszcz","Juliana Franco","Dominik Grewe","Dougal Maclaurin","James Molloy","Tom Natan","Tamara Norman","Xiaoyue Pan","Adam Paszke","Norman A. Rink","Michael Schaarschmidt","Timur Sitdikov","Agnieszka Swietlik","Dimitrios Vytiniotis","Joel Wee"],"pdf_url":"https://arxiv.org/pdf/2401.11202v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12830v1","updated":"2024-01-23T15:07:49Z","published":"2024-01-23T15:07:49Z","title":"Enhancing Next Destination Prediction: A Novel LSTM Approach Using\n Real-World Airline Data","summary":" In the modern transportation industry, accurate prediction of travelers' next\ndestinations brings multiple benefits to companies, such as customer\nsatisfaction and targeted marketing. This study focuses on developing a precise\nmodel that captures the sequential patterns and dependencies in travel data,\nenabling accurate predictions of individual travelers' future destinations. To\nachieve this, a novel model architecture with a sliding window approach based\non Long Short-Term Memory (LSTM) is proposed for destination prediction in the\ntransportation industry. The experimental results highlight satisfactory\nperformance and high scores achieved by the proposed model across different\ndata sizes and performance metrics. This research contributes to advancing\ndestination prediction methods, empowering companies to deliver personalized\nrecommendations and optimize customer experiences in the dynamic travel\nlandscape.\n","authors":["Salih Salihoglu","Gulser Koksal","Orhan Abar"],"pdf_url":"https://arxiv.org/pdf/2401.12830v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.12824v1","updated":"2024-01-23T14:59:46Z","published":"2024-01-23T14:59:46Z","title":"MAPPING: Debiasing Graph Neural Networks for Fair Node Classification\n with Limited Sensitive Information Leakage","summary":" Despite remarkable success in diverse web-based applications, Graph Neural\nNetworks(GNNs) inherit and further exacerbate historical discrimination and\nsocial stereotypes, which critically hinder their deployments in high-stake\ndomains such as online clinical diagnosis, financial crediting, etc. However,\ncurrent fairness research that primarily craft on i.i.d data, cannot be\ntrivially replicated to non-i.i.d. graph structures with topological dependence\namong samples. Existing fair graph learning typically favors pairwise\nconstraints to achieve fairness but fails to cast off dimensional limitations\nand generalize them into multiple sensitive attributes; besides, most studies\nfocus on in-processing techniques to enforce and calibrate fairness,\nconstructing a model-agnostic debiasing GNN framework at the pre-processing\nstage to prevent downstream misuses and improve training reliability is still\nlargely under-explored. Furthermore, previous work on GNNs tend to enhance\neither fairness or privacy individually but few probe into their interplays. In\nthis paper, we propose a novel model-agnostic debiasing framework named MAPPING\n(\\underline{M}asking \\underline{A}nd \\underline{P}runing and\nMessage-\\underline{P}assing train\\underline{ING}) for fair node classification,\nin which we adopt the distance covariance($dCov$)-based fairness constraints to\nsimultaneously reduce feature and topology biases in arbitrary dimensions, and\ncombine them with adversarial debiasing to confine the risks of attribute\ninference attacks. Experiments on real-world datasets with different GNN\nvariants demonstrate the effectiveness and flexibility of MAPPING. Our results\nshow that MAPPING can achieve better trade-offs between utility and fairness,\nand mitigate privacy risks of sensitive information leakage.\n","authors":["Ying Song","Balaji Palanisamy"],"pdf_url":"https://arxiv.org/pdf/2401.12824v1.pdf","comment":"Finished May last year. Remember to submit all papers to arXiv early\n without compromising the principles of conferences"},{"id":"http://arxiv.org/abs/2401.12822v1","updated":"2024-01-23T14:55:46Z","published":"2024-01-23T14:55:46Z","title":"Deep Learning Based Simulators for the Phosphorus Removal Process\n Control in Wastewater Treatment via Deep Reinforcement Learning Algorithms","summary":" Phosphorus removal is vital in wastewater treatment to reduce reliance on\nlimited resources. Deep reinforcement learning (DRL) is a machine learning\ntechnique that can optimize complex and nonlinear systems, including the\nprocesses in wastewater treatment plants, by learning control policies through\ntrial and error. However, applying DRL to chemical and biological processes is\nchallenging due to the need for accurate simulators. This study trained six\nmodels to identify the phosphorus removal process and used them to create a\nsimulator for the DRL environment. Although the models achieved high accuracy\n(>97%), uncertainty and incorrect prediction behavior limited their performance\nas simulators over longer horizons. Compounding errors in the models'\npredictions were identified as one of the causes of this problem. This approach\nfor improving process control involves creating simulation environments for DRL\nalgorithms, using data from supervisory control and data acquisition (SCADA)\nsystems with a sufficient historical horizon without complex system modeling or\nparameter estimation.\n","authors":["Esmaeel Mohammadi","Mikkel Stokholm-Bjerregaard","Aviaja Anna Hansen","Per Halkjær Nielsen","Daniel Ortiz-Arroyo","Petar Durdevic"],"pdf_url":"https://arxiv.org/pdf/2401.12822v1.pdf","comment":"Journal Paper"},{"id":"http://arxiv.org/abs/2401.12820v1","updated":"2024-01-23T14:53:32Z","published":"2024-01-23T14:53:32Z","title":"DatUS^2: Data-driven Unsupervised Semantic Segmentation with Pre-trained\n Self-supervised Vision Transformer","summary":" Successive proposals of several self-supervised training schemes continue to\nemerge, taking one step closer to developing a universal foundation model. In\nthis process, the unsupervised downstream tasks are recognized as one of the\nevaluation methods to validate the quality of visual features learned with a\nself-supervised training scheme. However, unsupervised dense semantic\nsegmentation has not been explored as a downstream task, which can utilize and\nevaluate the quality of semantic information introduced in patch-level feature\nrepresentations during self-supervised training of a vision transformer.\nTherefore, this paper proposes a novel data-driven approach for unsupervised\nsemantic segmentation (DatUS^2) as a downstream task. DatUS^2 generates\nsemantically consistent and dense pseudo annotate segmentation masks for the\nunlabeled image dataset without using any visual-prior or synchronized data. We\ncompare these pseudo-annotated segmentation masks with ground truth masks for\nevaluating recent self-supervised training schemes to learn shared semantic\nproperties at the patch level and discriminative semantic properties at the\nsegment level. Finally, we evaluate existing state-of-the-art self-supervised\ntraining schemes with our proposed downstream task, i.e., DatUS^2. Also, the\nbest version of DatUS^2 outperforms the existing state-of-the-art method for\nthe unsupervised dense semantic segmentation task with 15.02% MiOU and 21.47%\nPixel accuracy on the SUIM dataset. It also achieves a competitive level of\naccuracy for a large-scale and complex dataset, i.e., the COCO dataset.\n","authors":["Sonal Kumar","Arijit Sur","Rashmi Dutta Baruah"],"pdf_url":"https://arxiv.org/pdf/2401.12820v1.pdf","comment":"The manuscript contains 13 pages, 9 figures and 7 tables"},{"id":"http://arxiv.org/abs/2401.12819v1","updated":"2024-01-23T14:53:20Z","published":"2024-01-23T14:53:20Z","title":"Dynamic Layer Tying for Parameter-Efficient Transformers","summary":" In the pursuit of reducing the number of trainable parameters in deep\ntransformer networks, we employ Reinforcement Learning to dynamically select\nlayers during training and tie them together. Every few iterations, the RL\nagent is asked whether to train each layer $i$ independently or to copy the\nweights of a previous layer $j